Explaining the largest IT outage in history and what’s next

By now everyone knows the massive IT outage that occurred at the end of the last week caused by a CrowdStrike update that crashed millions of Windows systems. While not a ransomware or cyber attack, critical services and business operations were hugely disrupted. The outage, that occurred on July 19th, caused many windows systems failing and revealing the infamous blue screen of death (BSOD). Not only were services and operations impacted, but it also opened our eyes to the vulnerabilities businesses can endure. IT expert Sean Michael Kerner shares an article on TechTarget explaining the outage, the businesses affected, and recovery times for businesses to get “back to normal”.

What caused the outage? The CrowdStrike Falcon platform was the main culprit. Widely used by organizations big and small using Microsoft Windows, the the flaw in CrowdStrike Falcon was inside of a sensor configuration update. With channel file 291, CrowdStrike inadvertently introduced a logic error, causing the Falcon sensor to crash and, subsequently, Windows systems in which it was integrated.

What services were affected? Microsoft estimated that approximately 8.5 million Windows devices were directly affected by the CrowdStrike logic error flaw. That’s less than 1% of Microsoft’s global Windows install base. Kerner lists the following services affected by this outage:

Airlines and airports – This outage caused grounded flights and significant delay and cancellations around the world.

Public transit – Public transit in multiple cities was affected in major US cities and globally.

Healthcare – Hospitals and healthcare clinics around the world faced significant disruptions in appointment systems, leading to delays and cancellations.

Financial services – Multiple payment platforms were directly affected, and there were individuals who did not get their paychecks when expected.

Media and broadcasting – Multiple media and broadcast outlets around the world, including British broadcaster Sky News, were taken off the air by the outage.

How long will it take businesses to recover from this outage? CrowdStrike was able to identify and deploy a fix for the issue in 79 minutes. Despite immediately finding and deploying a fix for the issue, the recovery process for businesses is complex and time-consuming. Kerner notes that among the issues is that, once the problematic update was installed, the underlying Windows OS would trigger BSOD, rendering the system inoperative using the normal boot process. “Some businesses were able to apply the fix within a few days. However, the process was not straightforward for all, particularly those with extensive IT infrastructure and encrypted drives. The use of the Microsoft Windows BitLocker encryption technology by some organizations made it significantly more time-consuming to recover as BitLocker recovery keys were required.” It could potentially take months for some organizations to entirely recover all affected systems from the outage.

How can businesses be better prepared for tech outages?

While it wasn’t a cyberattack, the CrowdStrike Windows outage highlighted the vulnerabilities of society’s heavy reliance on technology. Kerner shares a few tips that businesses can do to be better prepared for tech outages, including the following.

Test all updates before deploying to production. “It has been a best practice for years to allow automated updates to ensure systems are always up to date. However, the CrowdStrike issue laid bare the underlying risk with that approach. For mission-critical systems, testing updates before deployment or having some form of staging environment before pushing updates to production might help to mitigate some risk.”

Develop and document manual workarounds. “Manual workarounds ensure critical business processes can continue even when technology fails. This approach was common before the digital age and, in the event of outage, can serve as a fallback. Documenting and practicing manual procedures can help mitigate the effect of outages, ensuring businesses can still operate and serve their customers, even during an outage.”

Perform disaster recovery and business continuity planning. “Outages happen for any number of different reasons. Having extensive disaster recovery and business continuity practices and plans in place is critical. Part of that effort should include the use of redundant systems and infrastructure to minimize downtime and ensure critical functions can switch to backup systems when needed.”

For Full Article, Click Here

Explaining the largest IT outage in history and what’s next

Leave a Reply

Leave a Reply Cancel reply