Blue Flower

Chaos Monkey is tool developed by Netflix to test the resiliency of their servers on the Amazon cloud when faced with failures. It periodically a terminates a random virtual machine that is running their application. Their automated error recovery is supposed to spin up a new virtual machine to replace the one that failed, and do so in a manner that appears seamless to customers.

Rather than just implementing the error recovery code, testing it once, and assuming that it will do the job, they are constantly testing it, figuring that if it really can seamlessly recover from failures, there should be no problem with randomly blowing away virtual machines.

Realizing that there is the possibility that the recovery could fail, they run Chaos Monkey between 9 AM and 3 PM on weekdays, so if a problem does occur, there will be people present who can deal with it. They also have a way for applications that they know are not ready for this to opt out.

This got me thinking about testing the agility of our teams.

One of the big reasons we do Agile development is so that we can change direction at sprint boundaries, if the priorities for delivering particularly stories changes. By finishing all their work by the end of the sprint, the team is able to change direction immediately.

Some teams have trouble understanding this. They resist breaking large stories into sprint-sized pieces, because they say it will increase the overall elapsed time to implement the change. This overlooks several things:

  • If it takes you six months to implement the change, the customer's needs may have changed by the time you finish.
  • If you try to test six-months-worth of coding and it doesn't work, you have to wade through all that code to find the error. If you are implementing it in Sprint-sized stories, you only have one-Sprint's-worth of code to look through.
  • Priorities may change. The Product Owner may need to have the team implement some feature ahead of time to keep from losing a customer. If you are two months into a six-month product and you have dozens of modules open, it is very difficult to change direction.

For teams that cling to elapsed time as the only viable metric, I would propose engaging the Product Owner in the following exercise:

If you have several epics that have similar priorities, mix them up each Sprint. If the first epic has stories A, B, and C, and the second epic has stories P, Q, and R, and the team is currently working on story A, they will expect that in the next sprint they will be working on story B, and in the next, story C. Instead, have them work on story A, then P, then B, then Q, etc.

This will make transparent how agile they are, and will help get them out of the habit of assuming that if they don't finish a story in a sprint, they can just roll it over to the next with no consequences.

Of course, just as Netflix runs Chaos Monkey only during weekdays when people are present, you would want to be careful about how you do this exercise:

  • Don't do it during a deadline crunch.
  • Let the team know a sprint or two in advance that you are going to be doing this, and that you expect them to be able to do this.
  • Discuss in the Retrospective how well this worked, and what they can do to make it work better.
  • A few stubborn teams may say that this is stupid and a waste of their time. You may have to remind them that the Product Owner is responsible for the priority of stories in the backlog, and the team is responsible for committing only to what they can accomplish a Sprint.

Even teams that already buy into the idea of being able to change direction at sprint boundaries may discover impediments to doing so that they didn't see before.

Just as Netflix exercises their recovery software so they know it will work when they get a real failure, teams should exercise their agility regularly, so that when a critical customer demand comes along, they are practiced and ready for it.