Explaining the largest IT outage in history and what’s next
What caused the outage? The CrowdStrike Falcon platform was the main culprit. Widely used by organizations big and small using Microsoft Windows, the the flaw in CrowdStrike Falcon was inside of a sensor configuration update. With channel file 291, CrowdStrike inadvertently introduced a logic error, causing the Falcon sensor to crash and, subsequently, Windows systems in which it was integrated.
What services were affected? Microsoft estimated that approximately 8.5 million Windows devices were directly affected by the CrowdStrike logic error flaw. That’s less than 1% of Microsoft’s global Windows install base. Kerner lists the following services affected by this outage:
Airlines and airports – This outage caused grounded flights and significant delay and cancellations around the world.
Public transit – Public transit in multiple cities was affected in major US cities and globally.
Healthcare – Hospitals and healthcare clinics around the world faced significant disruptions in appointment systems, leading to delays and cancellations.
Financial services – Multiple payment platforms were directly affected, and there were individuals who did not get their paychecks when expected.
Media and broadcasting – Multiple media and broadcast outlets around the world, including British broadcaster Sky News, were taken off the air by the outage.
Test all updates before deploying to production. “It has been a best practice for years to allow automated updates to ensure systems are always up to date. However, the CrowdStrike issue laid bare the underlying risk with that approach. For mission-critical systems, testing updates before deployment or having some form of staging environment before pushing updates to production might help to mitigate some risk.”
Develop and document manual workarounds. “Manual workarounds ensure critical business processes can continue even when technology fails. This approach was common before the digital age and, in the event of outage, can serve as a fallback. Documenting and practicing manual procedures can help mitigate the effect of outages, ensuring businesses can still operate and serve their customers, even during an outage.”
Perform disaster recovery and business continuity planning. “Outages happen for any number of different reasons. Having extensive disaster recovery and business continuity practices and plans in place is critical. Part of that effort should include the use of redundant systems and infrastructure to minimize downtime and ensure critical functions can switch to backup systems when needed.”
Leave a Reply
Want to join the discussion?Feel free to contribute!