PwC and AWS Alliance

Chaos engineering: Finding resilience gaps in workloads

Blog
5 minute read
October 16, 2024

Ross Chernick

Director, Cloud & Digital Transformation, AWS Ambassador, PwC US

Nausheen Jawed

Director, Cloud & Digital Transformation, PwC US

In recent years, chaos engineering has often been misunderstood as a way to intentionally cause failures in production systems, leading many companies to avoid implementing it. However, the true objective of chaos engineering is not to break production systems.

Instead, chaos engineering provides a valuable tool for teams to gain valuable insights into their workloads. It can be seen as a tool to find resilience gaps in existing workloads. By conducting controlled chaos experiments that are based on real-world hypotheses, teams can better understand the impact of potential failures.

Our chaos engineering framework

Steady state definition

Define a steady state for your systems or normal operating conditions of the system, against which the effects of the injected chaos are measured.
Collect and analyze data from the system during stable conditions to establish performance baselines and identify patterns of normal behavior.
Formulate hypotheses or assumptions about potential weaknesses or vulnerabilities in the system that can be tested through chaos experiments.

Define chaos engineering test

Plan and design controlled chaos experiments that simulate various failure scenarios or disruptions within the steady state.
Identify specific components or services to target and determine the scope and severity of the experiments.
Utilize AWS services like AWS Fault Injection Simulator (FIS) built specifically for running chaos engineering experiments on AWS services.

Execute experiments

Set up the necessary infrastructure to conduct the chaos engineering tests.
Create test environments, configure monitoring and logging systems, and ensure that backup and recovery mechanisms are in place.
Run the defined experiments.

Identify failures, deploy fixes and re-test

Monitor and observe the system's behavior during chaos experiments.
Collecting data and analyze the results to assess the impact on performance, stability, and resilience.
Compare the metrics and performance data collected during chaos experiments with the baselines established during the steady state.

Continuously test and improve

Continuously repeat the above steps periodically to confirm that the system remains resilient and can withstand potential failures over time.

Our disaster recovery orchestrator framework allows you to run controlled chaos engineering tests to find resilience gaps in your existing workloads.

The framework serves as a consolidated platform that is capable of two integrations:

Amazon Fault Injection Service (FIS) experiments:
- AWS Fault Injection Service (AWS FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads.
Custom experiments using disaster recovery orchestrator framework:
- In scenarios where your organization is not using AWS FIS experiments today or your application requires custom experiments, our disaster recovery orchestrator framework allows you to run those experiments.

AWS service	Sample chaos engineering experiments	Enablement method
Amazon Aurora	Simulate network latency between Aurora instances Introduce failures in Aurora replica instances Test the impact of increased load on Aurora read and write capacity	Disaster recovery orchestrator framework
Amazon Kinesis	Simulate increased data ingestion rate to test the scalability of Kinesis streams	Disaster recovery orchestrator framework
Amazon EC2	Test Spot Instance interruptions	Amazon Fault Injection Service (FIS)
Amazon DynamoDB	Denies traffic to and from the regional endpoint for DynamoDB in the current region	Amazon Fault Injection Service (FIS)

Organizations are seeking ways to conduct chaos engineering experiments regularly using their existing deployment pipelines. This gives you the ability to find resilience gaps in existing workloads and document the risks in your application. Running the chaos engineering tests within your pipelines also allows you to assess your operational practices. It allows you to assess that your monitoring and alerting processes are in place to inform you about scenarios that could cause an outage in your application.

If you would like help in addressing these questions or want to explore any of the focus areas listed above, reach out to us.