Hi everyone who lands on this page! Welcome back to “I am confused about cybersecurity.” This is my second write-up in the series, and today, we’re diving into a recent global IT outage that was caused by none other than a security update from Crowdstrike. Funny enough, the software that’s supposed to protect you from cyberattacks ended up acting like one. Talk about irony! Now hackers can’t hack you because your systems won’t even start. Best protection ever, right?
What Happened?
On July 18th, a minor content update to Crowdstrike’s Falcon sensor caused Windows systems worldwide to crash with the infamous blue screen of death(BSOD). While MacOS and Linux systems were spared, the update wreaked havoc on Windows endpoints, demonstrating that even with robust IT infrastructures, single points of failure can still bring everything down.
The Culprit: A Rogue Channel Update File
The disaster was triggered by channel update files names starting with the pattern “C-00000291*.sys” in the `%WINDIR%\System32\drivers\CrowdStrike` directory. This file was probably invalidly formatted, causing the top-level Crowdstrike driver to crash. Different customers received different versions of this problematic file, which further complicated the situation.
The Fix
The immediate fix was to reboot into safe mode, delete the rogue file, and then boot the system normally. Fixes for cloud systems is to unmount the volume attached to the virtual server and then mount to another server, delete the file, unmount for temporary server, and mount back to the original Virtual Machine.
Official Source: https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/
Why Did This Happen?
The invalid format of the file could cause the BSOD (Blue Screen of Death) due to several reasons:
Edit: Yes, it was a null pointer access, at the time of writing it was still under analysis, so updating now.
3. Unhandled Exceptions: The driver might not have proper error handling for malformed files. When the driver encounters unexpected data, it might throw unhandled exceptions, causing the system to crash.
4. Critical System Operations: The file might be critical to the driver’s operation. An invalid file could disrupt critical functions the driver performs, such as security checks or communications with other system components, leading to a BSOD.
These scenarios highlight the importance of validating and thoroughly testing all updates before deployment, especially those involving low-level system components like drivers.
Who’s to Blame?
Blaming someone for this mess is not straightforward:
Lessons Learned
1. Rigorous Validation: Updates, especially those rolled out to production environments, need thorough validation.
2. Manual Review: Having a manual review step for critical updates can prevent such widespread issues.
3. Comprehensive Testing: Testing updates across all systems can catch potential problems before they cause major outages.
4. Clear Communication: Open communication channels for reporting and addressing issues quickly are crucial.
Historically, we’ve seen similar incidents where a small oversight in the update process led to significant disruptions. For example, Microsoft’s infamous Windows 10 October 2018 update deleted user files due to a bug that slipped through the cracks. This highlights the need for meticulous testing and validation.
Conclusion
This Crowdstrike incident serves as a wake-up call for all of us in the cybersecurity field. It underscores the importance of robust validation processes, comprehensive testing, and the need for manual oversight in critical updates. Let’s hope this incident pushes companies to strengthen their processes and prevent such disasters in the future.
Having endpoint security as a checklist for compliance is another reason why these tools are trusted blindly.
Hope you enjoyed reading this and learned something new! Happy learning!
If you would like to support me, please buy me a coffee. :)