An Introduction To Chaos Engineering With Christophe Rochefolle
Christophe and I have met for the first time in October 2017. He organized a meet-up on Chaos Engineering. To my delight, among the helpful reading list was my book Chaos, a User’s Guide, and Christophe was kind enough to invite me as well. He is a true practitioner of what could be called the art of mastering chaos. Also, being an engineer, he is the kind of guy who won’t let a theory fade into oblivion or remain on lips only but will take it to the test.
Watch the Interview (in French)
How Does One Becomes a Chaos Engineer?
Perhaps as an occupational hazard, Christophe Rochefolle doesn’t tell much about his career path. He seems to live way more in the present, the here-and-know, than in any narrative. A true chaote cannot let himself locked up inside a narrative after all. Life is always more complex, moving, and messy than any neat storyteller would willingly admit.
Regardless, Christophe mentions two factors that led him to a still hardly known practice. The first is his relentless interest to trans-disciplinary objects, every bounded field seeming all too limited and pushing him to look for links rather than taken-for-granted limits. The second one, which spawned later during his career, was a focus on incremental improvements. Not through micromanagement—“I’m coming up with my cap and I’m telling you, this is OK, this isn’t” —but rather continuous change and frequent modifications.
After years as a quality assurance (QA) manager at various companies, Christophe saw the potential of Agile methods of development in the course of their first years. In June 2009, he managed to convince vente-privee.com that they needed it—and he helped the company to become a leading Internet player.
Indeed, vente-privee.com, which means “private sale” and was created in 2001 as a simple destocking shop, is now among the top 3 French selling websites (FR). Its short-term, slashed prices selling campaigns was perfectly fit with a time of contracting economy and it needed truly able IT people to keep it available as it became one of the most visited websites within the country.
Later, in 2016, Christophe was hired by the SNCF—the national railroad company—to make sure that their last website, oui.sncf, remained alive and well. This website is now the first in the country in sales volume.
“Before, software development was quite different. A lonely developer or head developer would spend a year and a half coding an app. Then he would deliver a long string of code obeying a 180 pages-long specification.” This worked during the 80s and 90s, when IT was simply trotting. In the course of the 2000s, said development mode started to show inadequate. In a year, “the market evolves, the programming and language standards evolve”, and a product conceived before that will fail to satisfy expectations that have changed.
Agility was an answer to the quicker pace of information technology. Coders do not work in a bubble anymore but take one- or two-week “sprints.” Then, they meet with the customer(s) and other stakeholders to agree on small or bigger adjustments. As a result, products are finished more rapidly, the risk to lose attunement with expectations becomes much smaller and all stakeholders have many more options to make the product evolve.
Christophe is also a long term proponent of DevOps. In IT engineering, this trend is about approximating or even fusing software development (dev) with system administration or “operations” (ops). At vente-privee.com, the DevOps mindset coalesced into a quest for functional improvement: how could the whole supplier chain become more agile? Or perhaps even lean? Christophe worked to tighten the triangle “ideas-value proposition-feedback” so that the latter could lead to quicker insights and implementations.
Why Science-Fiction Has (Almost) Died
In that field, accelerating history is not a textbook thing. In 2012, voyages-sncf.com—the older version of oui.sncf—had 4 updates a year. In 2018, “we have one or two updates every day” for a total of 446 updates. The current version has 17 IT teams. This means a high rate of instability and many challenges, with innumerable potential bugs. Coordinating all that is definitely not easy.
The oui.sncf Internet apparatus is mostly made of webservices and microservices. Instead of “big blocks”, which are frail, the apparatus is built of many small processes that can be tailored, customized, re-used or deleted. Just like spare parts, with the important difference that each noticeable feature involves hundreds of microservices. And said features do not even include externalized processes like the payment system.
Complexity levels in IT engineering are today at an all-time record. Things have never been that complicated, even ten years ago. Christophe also mentions another complexity factor, namely AI: programs have their own behaviors. Conditioned by user’s action, these programs interpret natural language, make their own patterns thanks to machine learning and sometimes show a puzzling behavior. “With AI, we are still apprentices”, Christophe say.
How can one thrive here? More precisely, how can all that complexity allow for emergence or even remain sustainable rather than collapsing under its own weight?
The Current Year world is unpredictable. This is why sci-fi novels have become passé. It was easy once to project a linear evolution view on the future, imagining that particular technologies would make their mark one after another. As so many innovations come together and blend at multiple scales, the best we can have now are near-future scenarios. No more “20.000 years later” giant arcs as Isaac Asimov mused in the 1950s. Just like depictions of the future from 1900, long-term prospective is now rather an art than a science.
As often, what happens in a field is often true in other fields—which is no surprise in a self-recursive fractal world. A trader once told me that as late as 2000, one could anticipate Y effect from something the FED chairman said and think along the same pattern with a number of other events. Now, he added, “you can’t connect anything” to anything else.
Thus, instead of predicting, one must assess and react quickly. To Christophe, this means prospecting for weaknesses. Hidden of “dark weaknesses” are the stuff of 0 day exploits and, sometimes, of costly malfunctions: better find them out before something bad happens. Mostly “found out from their effects”, such weaknesses can be looked for and intuitioned before they are really uncovered.
A small, relatively non-complex system can be thought of according to rather simple rules. Your computer, for example, has internal mechanisms of constant monitoring. It has a number of sensors, which you can use to know how hot your core is, how fast your CPU, how much memory is used, how much power is consumed… in each case, if any exceeds a known limit, an alarm should ring and/or an automated protective reaction happen. The software version of this lies in the error messages we are all familiar with. When you get a blue screen with something like “0xc000021a”, it means the error was thought of beforehand and the computer reacted as planned.
In the case of a system much more complex than that, the targeted monitoring gives way to observability. Instead of a mode where this-and-that are considered relevant information, now everything is monitored. Such big data has a cost, but it is also a means to get a wider picture to assess what happens. If something—anything—needs to be analyzed, any slice of events can be taken away for further study. “Everything is logged in, so it can be tracked.”
Jesse Robins, one of the pioneers of chaos engineering, was a “master of disaster” at Amazon. Before working on IT, Robins had been a firefighter. The experience gave him a knack at emergency management. He spearheaded a wide training exercise, the Game Day. The company simulates a major failure, something that could happen, and the evaluators see how IT teams react to the problem, assess it and solve it (or not!). I hope the HR department didn’t become jealous: after all, the point lies in checking human behavior rather than mere software functioning.
As he joined oui.sncf, Christophe Rochefolle used Robins’ recipe to create his own training exercise. Dubbed “Days of Chaos”, these take the form of a competitive game played by teams of voluntary IT people. The game lasts half a day. It features 30-minute sessions of mock failures. The teams only have so much time to find where the problem lies and react. If, after half an hour, no one found the problem, the session stops and the failure is explained.
Christophe insists on the importance marketing. “Had I dubbed this a failure detection training, no one would have been interested. But ‘chaos’ hits a nerve.” The word instinctively resonates within the psyche and makes everyone cautious of what’s going on. Sometimes the orchestrated failures can be simple—say: a server fails—and sometimes infinitely more complex. In the latter case, disentangling the knot becomes a true challenge, which comes with an experimental value as well.
The first day of chaos was organized on Friday, 13 January 2017. Christophe and the other ones in charge were definitely not afraid to, as we say in French, “tempt the Devil”! 113 computer specialists registered. Later, there were over 180.
Game is a learning style. Training is already important as a hand-on practice, but “gamificating” the corporate culture pushes it to another level: many time we remember best what we’ve learned while being actively engaged, even having fun, than during a boring compulsory session. At the end of the game, the best participants receive awards and teams are ranked. Christophe quotes Ender’s Game, a sci-fi novel featuring a gifted boy who becomes an excellent real-life war strategist through relentless game training, as a prime inspiration.
“We made a video with footages of trains, subways, even company board members that started to panic as the network descended into chaos”, Christophe recalls. “There was a lot of marketing here.” Then, on the last week before the day of chaos, all company computers displayed special wallpapers with a clear warning: chaos is coming…
“One can come up with many strategies, but every strategy can be eaten up by the situation.” No planned strategy avoids blind points and pitfalls. The training sharpens something more primal, an ability to probe and intuition within the dense apparatus one is supposed to care for.
“Each team has a team dashboard. Advancement, team spirit, everything relevant is monitored and showed there. These metrics allowed the teams to see that production and business indicators”, even though they did not belong to the game specifically, “were really important. For two years, they had been told to check these indicators on a regular basis and they didn’t… then, by the second day of chaos, they were all familiar with these.” A small change that may yield serious advantages. Once again, it came through practice, not micromanagement.
The Monkeys are Coming
Then, after the days of chaos, came the chaos monkeys. The idea is more or less the same although pushed to the next level. Now, instead of only simulating the problem by putting it into a game, a wrench will truly be thrown within a well-functioning machine. And instead of an all-seeing organizer deciding to plant this or that failure, the chaote task is handed to a small software that will create a problem at random.
Imagine a monkey entering a ‘data center’, these ‘farms’ of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy. (Source)
The “chaos monkey” can be implemented as a permanent feature. It forces the IT people to be proactive. Every X days, or weeks, or at random, the chaos monkey will create a failure. It will happen, you won’t be “lucky” enough to face no problem, so you should better cope with that. As a result, both human and machine resources become more resilient, and when an involuntary failure happens, the problem is much less acute than it would have been without training.
Do you know how we call resilience at work? We say, hey, it didn’t even hurt!
Beyond a certain size or scope, a company’s IT is absolutely bound to fail. The failures don’t have to be big, most of the involuntary ones are usually small, but in a complex system a small problem can have large consequences. (This is an example of the butterfly effect).
No wonder Amazon had a keen interest in reliability engineering: when one handles dozens of data centers all over the world, totalizing hundreds or rather thousands of servers, one can be sure that there will be failures. These may happen all the time—and cannot be avoided. However, the impact of said failures can be minimized and entropy remain marginal or insignificant on the users’ side.
How Can You Sell Chaos?
At this point of the interview, I can’t help but remember what my friend Luc Taesch, a specialist of agile development, told me: agility comes from the world of coding, not from management. Managers tend to dislike agile development because of the quick pace, amount of in-road changes, not to mention increases in cost. Now, if this is true in the case of agile development, this should be even truer here: how can one go to board members and tell them, “I’ll throw some monkey wrenches here and there and you’ll pay me for that”?
Admittedly, Christophe answers, the idea is rather counterintuitive. Nevertheless, it can be sold to managers. First, they know how quick the pace tends to be, how unstable the world is, and they also know how important resilience can be. Many of them already train through practicing sport, so why wouldn’t they admit the same thing applied to IT? As Christophe mentions, the typical “how could this particular problem never happen again?” question has become null—systemic complexity allows for infinitely varied problems to happen and the only answer to them is resilience, not rigid protections.
Second, and I’d indeed bet that all managers will be sensitive to that argument, chaos engineering saves money. It allows not to lose too much, i.e. save and have a better financial balance.
According to a CIRCA (Communication and Information Resource Centre Administrator, part of the European Commission) report, major European companies have on average three big problems per month. Each of these costs them 115.000 euros each time. In addition to these direct losses, such companies tend to also be famous, which means that they must remain available and will get a reputation blow if they fail at serving their customers.
Availability has become a top priority: since 2016, many managers perceive it as more important than security. Good marketing works wonders. Tell a manager that the company was completely unavailable for 5 minutes and he’ll probably shrug. Tell him that “we lost a 100.000 euros contract because we were unavailable” and he will understand without a shred of doubt. Money binds everyone—and is definitely a good argument to sell chaos engineering.
Once the unavoidability rate has been contained below a certain limit, the company should be safeguarded from financial or reputational losses related to not being “here” a hundred per cent of the time.
“In 1990,” Christophe remembers, “if an incident happened, only a small clique of specialists knew. In 2008, voyages-sncf.com was displayed on the 8 o’clock news because of a serious IT incident.” Since then, fortunately, it never happened again, even though the conditions did change. “Today, if we have a noticeable failure, users will start mentioning it on Twitter and Facebook within a minute.” If the reputational risk looks higher, the website also managed to turn the tables: specialists monitor the social networks just in case unsatisfied users start mentioning a particular problem. In other words, the company uses random users as unknowing probes!
The difference between chaos and entropy is sometimes blurry. Put too much volatility within the system and it may completely break down. Although not evident in practice, the difference matters much, for chaos can have an infinite potential whereas entropy has almost none. How does a chaos engineer make sure that his willful mayhem remains tolerable and does not utterly destroy the system?
“You have to set a perimeter”, Christophe answers. “This is about toying, experimenting, testing, not breaking.” These intents are all the more important as the practice remains ambiguous: problems used for chaos monkeying should avoid having too much impact on the rest of the system, but should also be relatively unknown so that something is learned from solving them. Guess the ambiguity is impossible to alleviate completely when chaos is involved.
If a weak link is found out, it should be improved and tinkered with again until no weakness is detected. Then, the chaos engineer moves to another perimeter, or enlarges the perimeter as to learn more. “In 2017, we focused on production processes, in 2018 we were rather focused on apps.”
Beyond the days of chaos and chaos monkey stands the Chaos Kong. This one is basically a major occurrence of chaos monkeying. It simulates the loss of an entire data center—something dramatic as the entire oui.sncf IT apparatus runs on 2 data centers. “We did it twice and we want to do it another time.” In spite of the risks, the element of reality remains: some network flux are genuinely removed, which means that the mock aspect could leak into the real workings of the app.
“The first Chaos Kong had zero user impact. The public hasn’t seen anything. However, the involved teams failed to find where the fault lied. They were good at minimizing the impact and making the system work on half its usual hardware, but their diagnosis was somehow below”, Christophe tells.
To make the game even harder and more realistic, the next—and third—Chaos Kong will be a surprise exercise.
Chaos Engineering as Part of the Corporate Culture
As said sooner, to work well, chaos engineering must become a staple like any other with companies. The monkey wrench-throwing should not be an exceptional and rather annoying event, but part of a continuous process. Besides the ability to react and assess well, these trainings foster teamwork. They help the workers to self-organize, to function like a truly living unit.
Once, Christophe remembers, the international oui.sncf-related website went through a major overhaul. The transformation was so big it made many members, IT and non-IT alike, fear a major failure and consequent financial loss. To counter the problem, Christophe and others organized a whole month of chaos. Before the overhaul came to the public, it was used as a beta version on a purely internal manner, and every day, a new failure would be purposefully implemented here and there. IT teams had to find the failure(s), assess and solve them. This made them way more confident in the update, which became, in turn, more resilient.
This exercise is a prime example of making chaos one’s friend. Instead of being passively feared, chaos is brought upon in a controlled manner and turned into a means to detect weaknesses.
Another memory that comes back to Christophe’s mind is as follows. oui.sncf data centers were equipped with shifting modules to ensure that, if a data center becomes unavailable, its data shifts automatically to the other data center and the system keeps working. A chaos monkey was launched, an unavoidable data center simulated. The game worked great. Then, one month later, the scenario really happened: someone pushed the emergency button and the whole data center was emergency-deactivated. The shifting module worked fine for the website, but not for the mobile app that started crashing.
Why didn’t the shift work for the mobile app data? A quick research uncovered the culprit: a route, a single route among the whole table, did not react properly to the shifting and prevented it from taking place as it should. This route was a rather small one. Deemed unimportant, it had not been tested as well as the others.
What About Non-IT Jobs?
The IT field stands at the vanguard of the business world. Problems related to complexity, fastness and unpredictability often rise first in IT. Thus, engineers have to come up with methods, assessments and answers that are picked up by other fields as well.
Consider the agile method. Created for software development, it is now used in architecture. A construction site involves many different jobs, different yet related viewpoints and estimates that are far from precise. No single architect can know everything that happens—much less calculate proper estimates. Thus, on a particular construction site, stakeholders meet every two weeks to disclose their own diagnosis and discuss precise questions, thus allowing everyone involved to know better how it is going on and what potential problems or delays may still happen.
The advantages of agility, after all, are quite obvious. Customers are more engaged, their expectations are better understood and those who work form a better, more productive team. But how could chaos engineering be brought out of IT?
Perhaps it is the other way around. Before chaos became an unavoidable reality to engineers, it had been a daily occurrence for manual workers from time immemorial. Firefighters spend their lives fighting an intrinsically unstable element and must adapt to it. The military knows this as well: any war is full of harsh and unplanned events that must be confronted immediately. And even though the army is the prototypical example of rigid hierarchy, armies like the US Military has gone agile by allowing small teams to make their own decisions.
If you have children, you can turn chaos preparation into a game even more easily. Tell your children to, say, stay at school or go to grandpa’s home if this or that problem happens, then use any opportunity to trigger the reaction you taught them. It will be fun, and in case a really chaotic situation happens, what could be more important than leading your progeny to safety?
What About Fractals?
If you watched the video, you may have noticed that, after more than 40 minutes of discussion, Christophe still didn’t talk about the “fractal company” scheme he mentioned at the beginning. I look at the timer and, even though everything Christophe said seems teeming with a disruptive yet improving potential, I can’t stifle a hint of eagerness. What about the fractal company? What does it mean—not to me, not in my book, but to someone who strives to do it?
His answer lies in three points:
1. Continuous integration. Many fractal objects in nature, such as clouds or snowflakes, tend to form quickly. Their fractal pattern is not set in stone but emerges through continuous change. To developers or managers, a fractal company means a place where continuous integration becomes a permanent principle: instead of “reorganizations” taking place every 6 months, change should take place whenever it is needed. The same principle can be found behind the DevOps trend, where development stops being an early stage and becomes a permanent feature.
2. Self-similarity or scalability. Thanks to continuous change, the system has more flexibility. It can grow, divide its parts if more of these are needed, or merge them if they shrink too much. However, no matter the size it gains or loses, its parts must remain roughly similar to each other.
Agility has three principles: the product, or the aim, the system which is concerned with the team and its well-being, and the technological with its own focus, methods and rules. Each of these principles must be embodied by specific head persons. This is true within teams and should also be at higher scales. For example, if a change touches several features, hence involves the work of several teams, these should form a “tribe” (Christophe’s word) where other head persons should embody these roles again at the scale of the tribe.
3. A living model. Teams are organized like cells, and tribes are a bunch of cells. A cell is not limited like a hermetic silo or ivory tower, but has a membrane; it keeps exchanging with neighboring cells while still upholding some limitations. “A cell that doesn’t exchange dies off, a cell that lets everything come in will die as well.” A cell can also move, and components of a cell can move as well from a cell to another, without the least problem. As always, nature remains ahead of science and we end up inspired by her.
The more history accelerates, the closer normalcy will be from the vanguard. What took ten years to become normal now takes some months. More than ever, discussing with those at the cutting edge helps to anticipate what is going to be.