Americas

  • United States
Bob Violino
Contributing writer

How chaos engineering can improve network resiliency

Feature
Oct 24, 20228 mins
Networking

Controlled experiments aimed at breaking the network can uncover previously unknown vulnerabilities

Leading remote teams  >  A businessman virtually works with a distributed network of teams.

Conventional wisdom says, ‘If it ain’t broke, don’t fix it.’ Chaos engineering says, ‘Let’s try to break it anyway, just to see what happens.’

The online group Chaos Community defines chaos engineering as “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Practitioners of chaos engineering essentially stress test the system and then compare what they think might happen with what actually does. The goal is to improve resiliency.

For network practitioners who have spent their entire careers focused on keeping the network up and running, the idea of intentionally trying to bring it down might seem a little crazy.

Why chaos engineering makes sense

But David Mooter, a senior analyst at Forrest Research, argues that chaos engineering is a logical response to an environment in which networks are distributed across multi-cloud platforms and are increasingly under cyberattack.

“The issue is that distributed systems are too complex for us to fully comprehend,” says Mooter. “This means they will violate our assumptions and do unexpected things. Modern resilience efforts must be grounded in the assumption that we cannot fully understand and predict how our systems behave.”

“The network isn’t always reliable,” adds Nora Jones, founder and CEO of incident management software provider Jeli, and a pioneer of chaos engineering when she worked at streaming service Netflix.

“The concept of testing the network is the same as testing CPU or anything else—to simulate unfavorable events and surface the unknown unknowns,” Jones says. Chaos engineering supports the concept of continuous verification, the idea that things are never totally reliable, and failure is constantly around the corner. “This is a constant battle to stay ahead of the eight ball, and it requires a mindset shift in how you approach operations.”

What is an example of chaos engineering?

Mooter says he worked with a company that did a simple chaos experiment involving misconfiguring a port. “The hypothesis was that a misconfigured port would be detected and blocked by the firewall, then logged to immediately alert the security team,” Mooter says.

The company ran the chaos experiment by periodically introducing a misconfigured port into production. Half the time, the firewall did what was expected, but the rest of the time the firewall failed to block the port. However, a secondary cloud configuration tool always blocked it.

“The problem was that secondary tool did not alert the security team, so they were blind to these incidents,” Mooter says. “Thus, the experimentation showed not just a fault in the firewall, but also flaws in the ability of the security team to detect and respond to an incident.”

Test methodically, not at random

Chaos engineering wouldn’t be useful if it randomly introduced faults that network or security teams weren’t aware of, and actually took down the production network or causes performance issues.

The chaos engineering methodology is very specific. To begin with, chaos engineering is primarily performed in non-production environments, Mooter says.

He adds, “You don’t break things randomly, but rather intelligently identify unacceptable risk, form a hypothesis about that risk, and run a chaos experiment to confirm that hypothesis is true.

“You would have a test group and control group so that you can be 100% confident anything that goes haywire is due to the fault you injected into the test group, not something unrelated that coincidentally happened at the time you ran the experiment.”

Like a scientific experiment, the hypothesis should be falsifiable, Mooter says. “Every time I run the experiment and the experiment succeeds, I gain more confidence that my hypothesis is correct,” he says. “And if it fails, then I’ve discovered new information about my system to correct my false assumptions.”

One of the main benefits of this approach is that it finds issues before they can have a big impact on business.

“Suppose there’s some obscure condition that will bring your payments service offline,” Mooter says. “Do you want to discover that in a controlled environment—probably non-production—where you can immediately shut off the fault and when people are actively monitoring the situation? Or do you want it to happen unexpectedly on a Friday evening when some key operations employees coincidentally happened to be on vacation?”

Best practices in chaos engineering

There are several best practices that organizations can apply when experimenting with chaos engineering:

  • Include Application Developers: Mooter says, “With complex distributed architectures, developers don’t have good intuition for the limits of their applications. When chaos engineering becomes part of software delivery, developers see more and more examples where their assumptions were wrong. This builds a habit of being more proactive in questioning your assumptions.”
  • Improve communication: At Netflix, where the company built its own chaos engineering tools and later open-sourced them, the idea “was to create a forcing function for engineers to build resilient systems,” Jones says. “Everyone knew that servers would randomly be shut down, and the system needed to be able to handle it. And not only that, people needed to know how to communicate with the right parties when this happened.”
  • Pick the right experiments: Networking chaos experiments “are arguably the most popular tests to model outages that cause unplanned downtime in today’s complex distributed systems,” says Uma Mukkara, head of chaos engineering at Harness, which provides chaos engineering tools and support services. Enterprises can leverage chaos engineering for specific experiments such as validating network latency between two services, checking resilience mechanisms in code, dropping traffic on a service call to understand the impact on any upstream dependencies, or introducing packet corruption into a network stream to understand application or service resilience, Mukkara says.
  • Loop in security teams: Chaos engineering can be applied to any complex distributed system, including network security, Mooter says. “For security, the mindset is to assume security controls will fail no matter how hard you try to be perfect,” he says. For example, a bank used chaos engineering to change what indicators it was measuring. Instead of simply keeping track of time without security incidents, it began measuring which specific security safeguards were known to be working, Mooter says.

Tips for controlling the chaos

Chaos engineering can come with risks such as bringing down a network during a busy, or even not-so-busy, time. That’s why it’s important to follow these guidelines.

Place limits on chaos engineering projects

“I don’t think you should give every engineer the keys to go around breaking things,” Jones says. “It’s a discipline—and more specifically it’s a people discipline more than a tooling one—so instilling the appropriate culture of psychological safety and learning is a prerequisite before chaos engineering can be effective.”

Learn from existing incident-response systems

Organizations should take time to ensure they are learning from the incidents they’re already having, Jones says. “If you’re considering chaos engineering, I guarantee there’s a wealth of information in incidents you’ve already had,” she says. “Explore those first and surface patterns from them” that will help in understanding the best types of experiments to run.

Have a way to pull the plug quickly

It’s a good idea to have an automated way to immediately abort a chaos activity when necessary, Mooter says. “Every chaos experiment should be designed to minimize the blast radius should things go wrong,” he says. “This can be at the infrastructure, application, or business layers.” For example, at the infrastructure layer, isolate the fault to a limited set of connections.

Federate the chaos engineering program

“Centralized chaos engineering teams don’t scale,” Mooter says. “Delivery teams do not learn and build intuition for resilience if they are not directly involved, so you lose the culture change benefit if it’s centralized.” It doesn’t make sense to create an “us vs. them” dynamic between the central chaos team and delivery teams, Mooter says.

“For example, a software firm found that in the past, a development team would point the finger at infrastructure for not providing enough disk space while the infrastructure team pointed back and asked why the developers wrote code that consumed so much space,” he says.

After embracing the chaos-engineering mindset, both sides have pivoted away from arguing over why the disk is full and progressed to asking how to make the system resilient against a filled disk, Mooter says.

Change the culture

Organizations using chaos engineering would be wise to create an experimentation culture, Mukkara says.

“No system can be 100% reliable,” she says. “However, your customer wants it to be available when they need it. You need to build a system that can withstand common failures and train your team to respond to unknown failures. This starts with experimenting to learn how your system behaves and functions and iterating on improvements over time.”

Visibility and transparency

Mukkara adds “Report and share learnings with multiple stakeholders of the issues you are finding and reliability improvements you are making to your system, to get the business engaged,” she says.

For example, report to product management leadership what failure modes a system is protected against, and how resilience mechanisms have been successfully tested. “This will give them confidence in understanding the system and the availability it should maintain,” Mukkara says. “You can also let them know what failure modes your system is susceptible to, so the issue can be prioritized or at a minimum acknowledged as an acceptable risk.”