The recent IT outage that sent shockwaves through global enterprises underscores a fundamental truth: the digital age, while transformative, is fraught with risks that can disrupt even the most well-prepared organisations. Recent incidents, which reverberated across various sectors, highlighted the imperative for robust resilience strategies and transparency in communication.
In an interconnected world, where cybersecurity measures like Endpoint Detection and Response (EDR) systems have become a staple, the paradox of protection is evident. A system, designed to fortify defences, inadvertently triggered widespread outages. This incident is a stark reminder that resilience must go beyond individual solutions and encompass an enterprise-wide approach to safeguard critical assets. It underscores the need for a truly enterprise-wide approach to resilience to protect what matters most.
As organisations complete their recovery and undertake post-incident reviews, we share our view and recommended actions for any organisation seeking to enhance its resilience.
Across all sectors, technical changes—from simple maintenance to major implementations—are a primary cause of IT incidents, often disabling resilience measures such as redundancies and failover capabilities.
Organisations' increasingly complex and integrated technology environments, which rely on numerous third-party services, make understanding service interactions challenging and consequently require rigorous control to avoid outages. If change management is not sufficiently rigorous, this leads to increased outage risks. Recent IT incidents show that even minor changes can cause major disruptions.
From a wider resilience and recovery preparedness perspective organisations need to prepare and test for major incidents that inflict more damage to their technical environment without which recovery is not certain. A successful Cyber ransomware attack presents responders with far a more severe challenge as it logically destroys an environment leaving the only route back a complicated and slow recovery from a compromised backup. Lessons from Cyber Recovery have a key role to play in guiding secure recovery from accidental IT disruption.
Preventative controls are essential, but organisations must also prepare for inevitable disruption by planning for severe yet plausible scenarios. The complexity of modern enterprises, with their myriad dependencies and 'black box' technologies, often hinders effective business continuity planning. True resilience necessitates an end-to-end understanding of service delivery, beyond functional silos.
Understanding the end-to-end delivery of critical business services is challenging but essential. Organisations that have done this planning would have been able to quickly absorb the disruption from the IT outage switching to tested workarounds resulting in minimal impact.
Tracking in real-time requires technology. Tech-powered dashboards enable executives to visualise different interdependent operations - and prioritise actions when faced with disruption.
Those with resilience technology platforms would have been able to easily establish the impact, invoking recovery strategies in order that the impact of the outage remained in the tolerances set by management.
The outage underscores the critical need for enhanced collaboration between Third Party Risk Management (TPRM), IT, and service owners. TPRM professionals need to work closely with IT to better understand digitisation, product development, and the technology architecture that underpins critical business services - an EDR provider needs to be listed as a critical resource.
The digital supply chain, with its inherent complexity and opacity, poses significant resilience challenges. Organisations must adopt a ‘resilience by design’ approach, emphasising comprehensive understanding and proactive management of third-party dependencies.
A well-coordinated response to IT disruptions extends beyond IT teams, requiring organisational alignment and strategic decision-making. In this outage, crisis teams had to stand up their ‘out of band’ communication tools to help them assess and respond to the situation. An effective crisis response would have required planning and the rehearsal of defined roles, responsibilities, and communication strategies. Those who responded well recognised that this crisis presented an opportunity to demonstrate resilience and accountability.
Effective crisis-management skills are developed through frequent exposure to the characteristics, pressures, and demands faced when disruption occurs. Leaders need to continue developing relevant skills, mindsets, and behaviours though tech-based microsimulations or simple scenario-planning discussions.
Finally it is vital that there is a clear understanding of the contractual frameworks that an organisation operates under and, critically, where they are protected when things do go wrong. Organisations need to make sure they are assimilating the precise data and information from the start of a response and through to recovery to support a credible and evidenced claim for any compensation - whether that is under service level commitments or under business insurance policies. Whilst businesses can’t just rely on insurance as their only mitigation many organisations won’t have tested the breadth and limits of coverage they have against scenarios like this.
Do we know how resilient our organisation is to unforeseen disruptions, including IT system failures and third-party dependencies?
Are our change management procedures sufficiently robust?
Have we tested our response capabilities for severe but plausible scenarios?
Are we investing in making the most critical parts of our business resilient?
How are we using technology to identify and monitor our vulnerabilities?
Do we have a clear understanding of our contractual protections, including the role of insurance?
This IT outage event only reconfirms that evolving risk landscapes necessitate a transformative approach to enterprise resilience. By addressing vulnerabilities, leveraging opportunities and preparing for severe but plausible events, enterprises can not only withstand disruptions but thrive amidst them.