Business Continuity Management / Disaster Recovery
,
Endpoint Security
,
Governance & Risk Management
CIOs Recommend Resilience Strategies in the Wake of Global CrowdStrike Outage
A faulty CrowdStrike Falcon sensor update last Thursday forced 8.5 million Windows PCs to crash and repeatedly reboot, displaying the infamous “blue screen of death.” The global outage, arguably the largest ever, disrupted businesses, airports and government agencies. But many CIOs believe the fallout could have been mitigated by investing in comprehensive data resilience strategies that can help restore corrupted data through orchestrated recovery.
See Also: High Speed Flash! Storage Solutions Safeguarding Data & Rapid, Reliable Recovery
According to an assessment conducted by supply chain risk resilience company Interos, the outage affected 674,620 direct customer relationships and over 49 million indirectly. While the U.S. was the most disrupted country, with 41% of affected organizations, major ports and air freight hubs across Europe and Asia also were affected. The U.K., Germany, Italy, France, Spain and the Netherlands accounted for 27.68% of affected entities, according to the report.
“This is nobody’s fault but CrowdStrike’s,” Howard Holton, COO of GigaOm, said in a blog post. “While they have committed to helping everyone affected, they have 24,000 customers, all of whom are impacted, so they cannot give each the attention they need. Billions of dollars in damage are being done to those companies from this outage.”
To protect Windows systems, CrowdStrike’s Falcon software is installed at the core [kernel] of the operating system. This type of tight integration can cause major problems when updates are not properly tested.
“Such incidents are not uncommon in the cybersecurity industry, but this one is particularly damaging because it stems from a QA and testing issue, not a cybersecurity breach,” Holton said. “The tight integration between Falcon and the OS made the damage far more widespread and the recovery process far more onerous.”
CrowdStrike counts nearly 60% of Fortune 500 companies and more than half of the Fortune 1,000 among its clients. Eight of the top 10 financial services firms and an equal number of leading tech companies deploy its services. It also serves six of the top 10 companies in the healthcare sector and seven of the top 10 in the manufacturing sector.
Since the issue could not be rectified remotely or automatically – each system has to be manually booted in Safe Mode, and the faulty CrowdStrike file has to be deleted – resolution isn’t immediate, but CIOs hope Microsoft and CrowdStrike will expedite a faster restoration.
Response and Remediation
Bennett, Coleman & Co. Ltd., the largest Indian media conglomerate, constituted a crack tech team for brainstorming and identified the problem. “The damage was significant, and did not have an automated route,” CIO Rajeev Batra said.
“Almost simultaneously, CrowdStrike detailed a workaround that gave us the confidence to roll it out expediently,” he said. “We finally got the systems back on time as the editors resumed their work in the newsrooms pan India.”
Meanwhile, Microsoft and CrowdStrike appeared to reach out to customers through updates on their blogs providing remedial information for system administrators.
In a Microsoft blog post, David Weston, vice president of enterprise and OS security, detailed the remediation plan: “CrowdStrike has helped us develop a scalable solution that will help Microsoft’s Azure infrastructure accelerate a fix for CrowdStrike’s faulty update. We have also worked with both AWS and GCP to collaborate on the most effective approaches.”
But when the clock is ticking, global CIOs must assume a dual responsibility, Batra said. They need to assure management and shareholders about quick resolution and minimize damage to business and work closely with their teams, guiding them in several “on the fly decisions.”
Staging and Testing
Enabling automatic updates to production in haste, without testing, can result in disastrous outcomes. Network administrators usually test new updates in sandboxed environments before rolling them out to all systems on their networks.
Tiaan van Zyl, CIO of DataNoble, said there is always the possibility that testing in sandbox environments “misses things,” especially when software is deployed on such a large scale. “The real world has so many variables that leave blind spots when we do testing,” he said.
“Lack of good QA practices by CrowdStrike is deeply upsetting. They should have caught this issue in testing before releasing it to the public. The fact that it affected every Windows OS since 2008 is inexcusable,” Holton said.
Sarbjeet Johal, founder and CEO of Stackpane, recommended staging process for testing updates. “Microsoft must investigate their processes around how they push updates to the field – some validation must be done in staging area,” he said. “It’s a wake-up call for global society about its reliance on digital systems.”
Incident Response and Business Continuity Plan
Incident response teams and plans are immediately activated as part of the disaster recovery and planning strategy – business continuity plan or BCP. The teams do intense triaging to find the root cause of the incident and then prepare detailed reports for the management. These reports are used to update the current incident response plan/BCP – to mitigate impact and risk, should the same incident reoccur.
“Ensure your contracts allow you to seek damages, as that may be the only recourse in such situations,” Holton said. He advised organizations to prepare for and prevent similar issues by developing and testing their recovery plans. “Consider using a completely different set of security tools for backup and recovery to avoid similar attack vectors. Treat backup and recovery infrastructure as a critical business function and harden it as much as possible,” he said.
Ajay Sabhlok, global CIO at Rubrik, said the outage is a “grim reminder about software quality control practices that have human dependencies and despite airtight DevOps processes, may result in bug leakage to production that can wreak havoc.”
“CIOs of the affected companies could have prevented widespread damage from this incident by investing in comprehensive data resilience that can help restore corrupted data through orchestrated recovery,” he said. “Data resilience is a reliable way to protect all data and recover from several disasters such as ransomware, data corruption and natural disasters.”
Lessons Learned
This incident has served as a critical reminder for CIOs to ensure comprehensive data resilience and disaster recovery plans.
First, dependency on a single vendor for a security solution is a risky proposition. Sandeep Sen, global CIO of Linde, said the company was able to mitigate the impact of this incident as it uses two separate EDR solutions for plant and office networks.
“Organizations might even review the need to separate EDR solutions within the office network, for instance, between servers and users,” he said. He advised organizations to “rethink architecture, not the vendor.”
Subhamoy Chakraborti, CTO of ABP, stressed the need for close coordination between functional teams during such incidents. “Stay calm and keep both the users and the senior management informed. Work closely with your teams and coordinate with the tech fraternity to find what others are doing in such a situation,” he said.
Krishnan Kutty C, general manager of IT at Gammon Engineers and Contractors, recommended redesigning the DR architecture to include multiple fallbacks. This may not always be cost-effective, but it can be driven by the business criticality of the application.
“Where possible, patching strategy could be redefined wherein latest minus one patch number could be applied instead of applying up to the latest one as soon as it is released,” he said. “Security patches are an exception to this as the latest one needs to be updated as soon as it is released to prevent zero-day attacks.”