PwC and AWS Alliance

Continuous resilience: Building confidence in your ability to recover from disruptions

  • Blog
  • 5 minute read
  • October 16, 2024

Ross Chernick

Director, Cloud & Digital Transformation, AWS Ambassador, PwC US

Email

Nausheen Jawed

Director, Cloud & Digital Transformation, PwC US

Email

Why do organizations conduct large scale game days?

  • Game days help in identifying weaknesses in the organization's systems, processes, and response plans. Through these simulations, organizations can pinpoint areas that need improvement and take proactive measures to address them before a real crisis occurs.
  • Overall, Resilience game days serve as a proactive approach for organizations to strengthen their ability to withstand and recover from disruptions, thereby reducing the impact on operations, reputation, and stakeholder trust.

But game days require resources. It involves designing scenarios, coordinating participants, conducting the simulation, and analyzing the results. Additionally, participants need time to prepare for and participate in the simulation. Proper documentation and analysis tools are essential for capturing the outcomes of the simulation and identifying areas for improvement.

What if there were ways to continuously test resilience?

It would provide ongoing insights into their ability to withstand and recover from various challenges and disruptions. The benefits would be:

  • Early detection of misconfiguration in the recovery environment: By detecting vulnerabilities before they escalate into major issues, organizations can take proactive measures to address them, thereby reducing the likelihood and impact of disruptions.
  • Reduced number of failed failover attempts: By continuously verifying configurations in the recovery environment teams can avoid scenarios of mis-attempts due to configurations not deployed as required in the recovery environment.
  • Leadership and developer confidence: Demonstrating a commitment to continuous resiliency testing can enhance stakeholders' confidence in the organization's ability to manage crisis effectively. By identifying areas for improvement through continuous testing, organizations can prioritize resource allocation more effectively.

The disaster recovery orchestrator framework, developed by PwC, allows you to continuously verify the health of your resources deployed in a secondary AWS region/environment. It allows you to continuously receive signals on the configuration of your secondary region, confirming your recovery region/environment is ready when you need to conduct a real recovery exercise. The framework serves as a consolidated solution that is capable of two integrations:

  • Amazon Route 53 Application Recovery Controller readiness checks:
    • A readiness check in Route 53 ARC continually audits for mismatches in AWS provisioned capacity, service quotas, throttle limits, and configuration and version discrepancies for the resources included in the check.
  • Custom readiness check using DR Orchestrator:
    • In scenarios where your organization is not using Amazon Route53 ARC today or your application requires custom readiness checks, our disaster recovery orchestrator framework allows you to conduct polling activities against infrastructure deployed in the recovery environment to evaluate provisioned capacity and configurations.

A combination of both these integrations puts the power in your hands to conduct these verification checks more continuously. Example checks across AWS Services that can be conducted using this approach are as follows:

AWS service Sample continuous resilience checks Enablement method
Amazon Aurora RdsGlobalReplicaLag: Inspects each Aurora cluster to confirm that it has a Global Replica Lag of less than 30 seconds. AWS Route 53 Application Recovery Controller
Amazon Aurora AuroraBinlogReplicaLag: The amount of time that a binary log replica DB cluster running on Aurora MySQL-Compatible Edition lags behind the binary log replication source. A lag means that the source is generating records faster than the replica can apply them. Disaster recovery orchestrator framework
Amazon DynamoDB DynamoConfiguration: Inspects all DynamoDB tables to confirm that they have the same keys, attributes, server-side encryption, and streams configurations. AWS Route 53 Application Recovery Controller
Amazon Kinesis KinesisConfiguration: Inspects all Kinesis streams to confirm that they have the same configurations in multiple regions. Disaster recovery orchestrator framework
Amazon Elastic Container Service ECSConfiguration: Inspects all ECS clusters to confirm that they have the same configurations in multiple regions. Disaster recovery orchestrator framework

The resilience lifecycle is an ongoing process. Continuous resiliency testing assists to find and fix any resilient issues and resolve them before it impacts its end users. In a nutshell, continuous resiliency testing can assist to enable resilient, fault-tolerant applications with increased uptime and customer trust, resulting in an overall enhanced end-user experience.

If you would like help in addressing these questions or want to explore any of the focus areas listed above, reach out to us.

Follow us