Endpoint Protection Platforms (EPP)
,
Endpoint Security
,
Governance & Risk Management
Cybersecurity Vendor’s Preliminary Review Details Problems, Promises Improvements
CrowdStrike has blamed internal testing failures, including buggy testing software, for failing to prevent the faulty “rapid content update” Friday that caused worldwide disruption.
See Also: Cyber Hygiene and Asset Management Perception vs. Reality
The company on Tuesday published its preliminary review into the incident, involving the faulty “Channel File 291” for its Falcon endpoint detection and response software.
After receiving the threat-update data, 8.5 million online Falcon-using Windows hosts crashed out to a “blue screen of death,” rebooted, then got stuck in an endless crash and reboot loop. Reflecting the types of organizations that use Falcon, the disruption led to serious outages across numerous critical sectors, including for major healthcare, banking, stock market and media organizations, as well as railways and airlines.
The report from CrowdStrike offers details of what happened and when, as well as the steps the company will take to try and prevent a repeat occurrence. The company has also pledged to release a full “root cause analysis” into the incident once it completes its investigation.
Security experts have saluted the timeliness and detail contained in CrowdStrike’s initial review. “It’s good and really honest,” said British cybersecurity expert Kevin Beaumont.
One “key takeaway,” he said, is that CrowdStrike has committed to a “smart” change, in the form of no longer deploying threat updates simultaneously to ever Falcon endpoint, but rather in a more careful, gradual and well-monitored process.
Many other security software vendors, including Microsoft, already don’t push endpoint protection platform updates simultaneously to every client. Not doing so helps the initial deployments serve as a canary in the coal mine, in case something unexpected occurs.
CrowdStrike pushed the faulty Falcon configuration update Friday at 04:09 UTC, leading to crashes. Seventy-eight minutes later, the company “reverted” the file. Some systems successfully rebooted, received the new file, and recovered. Many more systems have required manual intervention.
Multiple airlines were temporarily grounded Friday due to the incident, stranding travelers. U.S. carrier Delta has been especially hard hit, although has been recovering. By Tuesday, the airline canceled just 14% of its flights, compared to 36% on Sunday, reported flight-tracking service FlightAware.
As of Monday, IT asset tracking provider Sevco Security reported seeing 93% recovery rates of CrowdStrike Falcon software among its client base.
Both CrowdStrike and Microsoft have released tools to help automate the process, many must be run from bootable USB drives, and so require remote workers to come on-site to get a fix.
On Tuesday, CrowdStrike delivered a previewed update, in the form of adding the faulty file to the CrowdStrike Cloud’s known-bad file list, since the faulty file likely still resided on numerous systems, even if it was no longer being accessed. The update was effective immediately for customers who use its US-1, US-2 and EU clouds, and available on demand for government customers.
One immediate upside from the move is that “for impacted systems with strong network connectivity, this action could also result in the automatic recovery of systems in a boot loop,” since affected systems may attempt to contact the CrowdStrike Cloud for updates, and receive instructions to excise the bad file, it said.
For organizations that use full-disk encryption, which is considered a best practice and also required by some regulations, recovering systems often requires entering a unique 48-digit key to unlock BitLocker full-disk encryption, which adds time and complexity to the recovery process (see: CrowdStrike Disruption Restoration Is Taking Time).
Preliminary Report
In its preliminary review into the incident, CrowdStrike said that on Feb. 28, it released an update to its Falcon sensor in the form of version 7.11, giving it new functionality for detecting threats, via what it calls an InterProcessCommunication, or IPC, template type. These templates are designed “to detect novel attack techniques that abuse named pipes,” which refers to operating system processes.
The IPC templates get distributed “in a proprietary binary file that contains configuration data,” which CrowdStrike said “is not code or a kernel driver.” The configuration data “maps to specific behaviors for the sensor to observe, detect or prevent.”
The company said it successfully stress-tested the new IPC template type on March 5, using “a variety of operating systems and workloads.” In April, the company pushed three new, separate IPC templates to users, which “performed as expected in production.”
The opposite happened Friday, when it pushed two new IPC templates to Falcon endpoints, and one of the templates “passed validation despite containing problematic content data,” it said. “When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”
Upcoming Testing and Deployment Changes
The company has promised to introduce a number of software resiliency and testing improvements, ranging from more thorough and varied types of testing, to updating the Content Interpreter in its software to better handle unexpected errors.
For rolling out future rapid response content, CrowdStrike said it will “implement a staggered deployment strategy,” gradually rolling out updates globally after “starting with a canary deployment.” The company said it will also give customers “greater control” over updates, including “when and where” they get deployed, and more closely monitor collective “sensor and system performance” to guide future content rollouts.
Security experts said the impact of a single faulty CrowdStrike software update reveals bigger-picture industry problems tied not just to technology but also interconnectivity (see: CrowdStrike, Microsoft Outage Uncovers Big Resiliency Issues).
“We have a small number of cyber companies effectively operating as God Mode on the world’s economy now,” Beaumont said in a blog post, when a more ideal scenario would involve customers being able to “have zero trust in cybersecurity vendors.”
Given the interconnectedness of software to the safe functioning of so many different parts of society, “there needs to be some way to enforce less risky behavior across all vendors,” he said. “This should also include Microsoft’s security solutions.”