PwC and AWS Alliance

Chaos engineering: Finding resilience gaps in workloads

  • Blog
  • 5 minute read
  • October 16, 2024

Ross Chernick

Director, Cloud & Digital Transformation, AWS Ambassador, PwC US

Email

Nausheen Jawed

Director, Cloud & Digital Transformation, PwC US

Email

In recent years, chaos engineering has often been misunderstood as a way to intentionally cause failures in production systems, leading many companies to avoid implementing it. However, the true objective of chaos engineering is not to break production systems.

Instead, chaos engineering provides a valuable tool for teams to gain valuable insights into their workloads. It can be seen as a tool to find resilience gaps in existing workloads. By conducting controlled chaos experiments that are based on real-world hypotheses, teams can better understand the impact of potential failures.

Our chaos engineering framework

Steady state definition

  • Define a steady state for your systems or normal operating conditions of the system, against which the effects of the injected chaos are measured.
  • Collect and analyze data from the system during stable conditions to establish performance baselines and identify patterns of normal behavior.
  • Formulate hypotheses or assumptions about potential weaknesses or vulnerabilities in the system that can be tested through chaos experiments.

Define chaos engineering test

  • Plan and design controlled chaos experiments that simulate various failure scenarios or disruptions within the steady state.
  • Identify specific components or services to target and determine the scope and severity of the experiments.
  • Utilize AWS services like AWS Fault Injection Simulator (FIS) built specifically for running chaos engineering experiments on AWS services.

Execute experiments

  • Set up the necessary infrastructure to conduct the chaos engineering tests.
  • Create test environments, configure monitoring and logging systems, and ensure that backup and recovery mechanisms are in place.
  • Run the defined experiments.

Identify failures, deploy fixes and re-test

  • Monitor and observe the system's behavior during chaos experiments.
  • Collecting data and analyze the results to assess the impact on performance, stability, and resilience.
  • Compare the metrics and performance data collected during chaos experiments with the baselines established during the steady state.

Continuously test and improve

  • Continuously repeat the above steps periodically to confirm that the system remains resilient and can withstand potential failures over time.

Our disaster recovery orchestrator framework allows you to run controlled chaos engineering tests to find resilience gaps in your existing workloads.

The framework serves as a consolidated platform that is capable of two integrations:

  • Amazon Fault Injection Service (FIS) experiments:
    • AWS Fault Injection Service (AWS FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads.
  • Custom experiments using disaster recovery orchestrator framework:
    • In scenarios where your organization is not using AWS FIS experiments today or your application requires custom experiments, our disaster recovery orchestrator framework allows you to run those experiments.
AWS service Sample chaos engineering experiments Enablement method
Amazon Aurora

Simulate network latency between Aurora instances

Introduce failures in Aurora replica instances

Test the impact of increased load on Aurora read and write capacity

Disaster recovery orchestrator framework
Amazon Kinesis Simulate increased data ingestion rate to test the scalability of Kinesis streams Disaster recovery orchestrator framework
Amazon EC2 Test Spot Instance interruptions Amazon Fault Injection Service (FIS)
Amazon DynamoDB Denies traffic to and from the regional endpoint for DynamoDB in the current region Amazon Fault Injection Service (FIS)

Organizations are seeking ways to conduct chaos engineering experiments regularly using their existing deployment pipelines. This gives you the ability to find resilience gaps in existing workloads and document the risks in your application. Running the chaos engineering tests within your pipelines also allows you to assess your operational practices. It allows you to assess that your monitoring and alerting processes are in place to inform you about scenarios that could cause an outage in your application.

If you would like help in addressing these questions or want to explore any of the focus areas listed above, reach out to us.

Follow us