Sections

Commentary

CrowdStrike and digital ecosystem transitivity

Kevin C. Desouza, Richard Watson, and
Richard Watson Research Director - Digital Frontier Partners
Yancong Xie
Yancong Xie
Yancong Xie Lecturer - Royal Melbourne Institute of Technology

August 28, 2024


  • On July 19, 2024, a major global outage was caused by a faulty software update issued by CrowdStrike, a prominent cybersecurity firm, with rippling effects across the world.
  • Such cascading failures are a consequence of  an increasingly interdependent digital society.
  • By drawing lessons from previous outages and crises, policymakers can mitigate the range and extent of these failures by managing dependencies and making linkages more robust.
Air Asia passengers queue at counters inside Don Mueang International Airport Terminal 1 amid the global outage that disrupted the airline's operations, in Bangkok, Thailand, on July 19, 2024.
Air Asia passengers queue at counters inside Don Mueang International Airport Terminal 1 amid the global outage that disrupted the airline's operations, in Bangkok, Thailand, on July 19, 2024. REUTERS/Chalinee Thirasupa

On July 19, 2024, a major global outage was caused by a faulty software update issued by CrowdStrike, a prominent cybersecurity firm. This update disrupted computers running Microsoft Windows, affecting many aspects of the economy, including airlines, media outlets, hospitals, banks, supply chains, small businesses, government, and other emergency services. For example, the New York Blood Center shifted to emergency ground transport after nearly nine percent of U.S. flights were cancelled after the outage, and 57% of scheduled airline routes experienced significant delays. Essential services, like national driver’s license offices also closed on that day, and the outage even had international impact on British Columbia’s health system and the Canada Border Service Agency. CrowdStrike confirmed that the outage was not the result of a cyberattack and ultimately fixed the issue despite the disruptions that persisted for several hours on that day.

Accidental or deliberate outages are becoming realities in a highly connected global economy. By drawing lessons from previous outages and crises, policymakers can mitigate the range and extent of the damage caused by failures cascading through digital ecosystems.

Digital transitivity in Australia

In 2023, we wrote about the effects of mass digital transitivity: the result of a global network of dependencies where many organizations are linked to many others, and the failure of one can cause a cascading sequence of failures. On November 8, 2023, an internet outage from Australia’s second-largest telecommunications company, Optus, caused disruptions across many vital sectors, including transportation, finance, health care, and security. The outage highlighted the interconnectedness of different systems and how a failure in one can propagate into others, creating substantial economic and social impacts. In this instance, individuals and businesses lost access to essential digital services, and transport systems in Melbourne were severely hindered, with about 500 train services being canceled due to communications errors.

Such rapid spread of the outage’s impacts was attributed to Australian society’s heavy dependence on Optus’s communication services and the lack of investment in robustness to handle such disruptions. The incident revealed the complexity of analyzing transitive influences and tracing failures to their root causes in interconnected digital systems. Post-incident analysis of the Optus shutdown indicated that the outage resulted from a planned software upgrade at a Singtel Internet exchange in North America, which misfired and triggered safety responses in company routers. The incident underscored the need of Australia (and many other digitally dependent economies) for an effective ecosystem-wide regulatory design to mitigate cascading disruptions.

Comparison to the global financial crisis

The global financial crisis of 2007 to 2009 highlighted the intricate dependencies within modern society and the cascading failures that can result. The collapse of Lehman Brothers in 2008 exemplified this, as it had extensive global ties through subsidiaries and derivatives contracts. The inability of key stakeholders to assess their exposure due to the lack of a global standard for entity identification exacerbated the economic contagion. Ensuring economic resiliency requires mechanisms to prevent the failure of one party from causing an expanding chain of dislocations, particularly in a connected society where digital commitments are crucial.

In response to the financial crisis, the Global Legal Entity Identifier (LEI) was introduced to uniquely identify a legal entity, such as a bank, and its linkages, thus improving financial resiliency. The LEI helps identify critical transitive relationships and exposures, providing a foundation for increasing global resiliency. Understanding capital and system exposures is vital for building resilient organizations in a highly interconnected society. To manage these risks, one proposal is to extend the LEI to record entities’ commitments, beyond financial obligations, to other entities. This would include access to essential equipment (e.g., chip-making technology) and services (e.g., payroll processing), which would help to pre-emptively identify and manage risks posed by transitive dependencies.

What the 2007 financial crisis and the recent computer outages reveal is that governments must require corporate reporting of dependencies, particularly if they impact an organization’s stability and ongoing operations; for example, extending global financial reporting standards, including comprehensive evaluation of ecosystem-level transitivity issues. To avoid future disruptions, policymakers need to design a regulatory framework to manage transitivity, which clarifies the importance of managing dependencies, identifies key stakeholders, and outlines their rights and obligations. This framework would ensure that citizens and organizational leaders are informed about their country’s dependence on others for critical resources, thereby enhancing national economic resilience.

Managing dependencies is a process

Reducing transitivity involves decreasing dependencies between connected organizations’ core assets, such as energy, communication, and raw materials, to prevent rippling waves of failures that could lead to significant disruptions. For example, if a nation’s primary telecommunications providers are heavily interdependent, a single service outage could affect the entire country. However, it is important to balance reducing dependencies with maintaining service integration to ensure emergency services can switch to alternative suppliers during outages. Transitivity solutions should also be “recovery-safe” to avoid introducing new risks. Furthermore, resuming operations after a mass outage should be rapid and not create another outage or other issues, including the financial costs to consumers for related disruptions. For example, the CrowdStrike patch was not quickly reversible. Systems had to be rebooted as many as 15 times, and this delay contributed to the billions lost by both companies and consumers before systems were reverted to their pre-patch state.

Digital twins of major networks, like a national grid, could be used to help simulate and identify major transitivity exposures, such as single points of connection whose failure would propagate rapidly and widely. Such digital twins, continuously recalibrated with sensor data, can be used to simulate disruptions and test intervention responses. Intelligent algorithms also can monitor infrastructures in real time and initiate circuit breakers to isolate disruptions. Another intervention that would have helped in the Optus failure is automatic network switching for emergency services to ensure vital services are not dependent on a single provider. In 2022, the EU introduced a single emergency number for use anywhere in the EU across all telecommunications providers.

Setting safe dependency guardrails

Going forward, policymakers need to establish regulations to set “safe” dependency levels like those limits on large banks to avoid systemic failure. Zero-trust mechanisms can help govern dependencies and limit transitivity effects. Zero-trust systems ensure that no device or entity is trusted by default; rather, they are authenticated as needed based on access needs, even if previously authenticated. Implementing “guard” transitivity mitigates cascading failures by creating electronic guards to reconfigure operations in the event of a failure. For instance, if guard transitivity had existed between Optus or Crowdstrike and their dependencies, critical traffic could have been redirected during an outage. These guards must be designed to avoid spreading malware and accelerating disruptions by requiring safeguards for a “recovery-safe” capability. Policymakers must define conditions and safeguards for these mechanisms and perhaps even consider a “National Resilience Board” to monitor and investigate guard mechanisms to reveal system weaknesses and guide policy revisions.

As accidental trips have amply demonstrated, a connected digital global society presents embedded and increasing risks as organizations seek the benefits of connectivity. A national transitivity policy and associated electronic guards are necessary to minimize the fires and floods that happen in a digital society.

Authors

  • Acknowledgements and disclosures

    Microsoft is a general, unrestricted donor to the Brookings Institution. The findings, interpretations, and conclusions posted in this piece are solely those of the authors and are not influenced by any donation.