Blue Flower

It has long been known in engineering circles that much can be learned from failure. Claude Albert Claremont, in his 1937 book on bridge building, "Spanning Space," wrote:

The history of engineering is really the history of breakages, and of learning from those breakages. I was taught at college "the engineer learns most on the scrapheap."

In one of my past lives, I was involved in performance car rallying. This involved racing in beefed-up cars over forest logging roads. In the ’70s, there was a driver named John Buffum. His day job was running a car dealership in Vermont, but he was also the top rally driver in the U.S. and had factory support from British Leyland, who supplied him with Triumph TR-7s, like this one:

John Buffum, Libre Racing © 1979, Lynn Grant

He was very fast, but he crashed a lot. Other drivers nicknamed him “Stuff 'em Buffum,” because he stuffed his car into the ditch so often. Each time, British Leyland would ship him a new TR-7 for the next race.

As time went on, he stopped crashing, but he was still very fast. Since he had crashed so many times, he knew exactly what the car felt like when it was right on the edge, so he knew when to back off. Other drivers who hadn’t had the luxury of getting a new car every race didn’t have this experience, so they had to be more cautious, and were thus slower. Buffum went on to win 11 National Pro Rally Championship titles and 117 Pro Rallies.

Tony Dismukes, a martial artist whose blog, BJJ Contemplations, I follow said this about practicing failure:

On the mats we have the opportunity to fail over and over and over again. This is the only way to learn the limits of our techniques and of ourselves. As Rener and Ryron Gracie are fond of saying: every technique can work some of the time, no technique works all the time. Only by testing our techniques to failure can we learn exactly when and how much we can rely on each one. Only by testing them to failure can we truly understand which details are crucial for success and why. Only by testing ourselves to failure can we understand exactly where our personal limitations are and begin to learn how to improve upon them.

Engineers, racers, and martial artists all see the benefits of learning from failure, but too often in software development we consider failure a luxury we cannot afford.

Much of this is because we are always under the gun, trying to develop software to meet some too-early deadline. This causes us to stick to things that worked in the past, rather than trying something new, something that, if it works out, could make the team much more efficient or, even if it fails, might teach us valuable lessons.

In his book, Kanban: Successful Evolutionary Change for Your Technology Business, David J. Anderson wrote:

You need slack to enable continuous improvement. You need to balance demand against throughput and limit the quantity of work-in-progress to enable slack.

Too often, we completely fill our release schedules with work, so that any failure that delays something by a sprint is a catastrophe. If we allowed ourselves enough slack to have experiments that failed from time to time, those failures would be made up for by the improved efficiency that comes from continuous improvement.

At most companies, there seems to be a never-ending supply of meetings to attend, and complaining about their sheer volume is popular water-cooler conversation. We may not be able to reduce the number of meetings, but can we make them more effective (and thus, shorter)? Certainly!

For example, many people still start and end their meetings on the hour. This ensures that the people in the meeting have to wait for everyone who had a previous meeting that ended on the hour to arrive. And if that meeting slopped over its end time by a few minutes, the delay is even more. If you start your meetings five minutes after the hour and end them five minutes before the hour, it gives people time to get from one conference room to another, and maybe even stop for a coffee or a bathroom break.

And about meetings that go over their scheduled time: When you do that, you are taking time that does not belong to you. It may be convenient for you to extend the meeting on-the-fly, but if the attendees have subsequent meetings that you are making them late for, you are wasting the time of the people in those meetings, which is terribly rude. On top of that, Agile development is all about delivering on time. If we can't even bring a meeting in on time, how can we hope to deliver software on time?

Another example: Many Scrum teams use what Mike Cohn calls commitment-driven iteration planning. (Agile Estimating and Planning, p158). The team first figures out the number of hours available in the sprint, based on the percentage of their time they expect to be able to spend on stories and any days off that team members have. Then they take stories one at a time, break them into tasks, estimate how long each task will take, and subtract the time from the available hours. They continue to do this until they run out of hours. At that point, their sprint is loaded.

I have been in sprint planning meetings where figuring out the days off can take 15 to 20 minutes. The team goes around the table, with each team member saying something like "OK, I was going to take a few days off, maybe this sprint, or maybe in July. Or maybe I could take them in April." Eventually, after a two- or three-minute monologue, the team member figures out how may days he or she is going to be out during the sprint. Then they move on to the next person.

It doesn't have to be this way. They all know that they are going to be asked about days off in the sprint; this happens every planning meeting. Couldn't they figure this out in advance, rather than making everyone sit around twiddling their thumbs while they muse about when it would be good to take vacation? In some cases, there are coverage issues. This can be handled by announcing in the daily standup that you would like to take some time off in the next sprint, and if there is a problem, working it out before the planning meeting.

And in spite of all that has been written in the literature about not keeping a room full of people waiting while a poor typist updates a story in the sprint-tracking tool, we are still doing it regularly. Write it down and do it later!

I have seen lots of stories in the Agile tracking tool that refer to issues in the problem tracking system, or to documents in documentation repositories. It is frequently possible to get a URL for these issues or documents, and this should be put in the story, rather than just saying "Issue 12345" or "the code spec is in the repository under this project". If you make people search for information that you must of had a link to at the time you were looking at it, you waste the time of everyone who has to look at your story.

Wasting people's time in meetings isn't just rude; it severely impacts the flow of the meeting. While you are searching through the document repository to try to find the referenced code specification, the rest of the team starts checking their phone for messages, or talking about last night's ballgame, or they decide this would be a good time to get a coffee or take a bathroom break. It may take you five or ten minutes to get the meeting back on track.

It is easy to gripe about meetings, but it is not much harder to make them a lot less wasteful, which will go a long way towards making them less irritating.

One oft-mentioned feature of Lean manufacturing is the andon light or the andon cord. The idea is that any employee on the assembly line who encounters a problem pulls the andon cord, the line is stopped, and the light comes on to indicate where the problem is.

By the way, andon is the Japanese word for paper lantern, which has apparently been generalized to mean any lantern. Here are a couple of good references about the history of andon lanterns:

Although andon lights are frequently mentioned in the Agile development literature, some of their most important points are sometimes glossed over.

In A Study of the Toyota Production System, Shigeo Shingo, an industrial engineer noted in connection with Toyota’s SMED (Single Minute Exchange of Dies) program that reduced setup times for punch presses from many hours to less than a minute, said:

The andon is a visual control that communicates important information and signals the need for immediate action by supervisor. There are some managers who believe that a variety of production problems can be overcome by implementing Toyota’s visual control system. At Toyota, however, the most important issue is not how quickly personnel are alerted to a problem, but what solutions are implemented. Makeshift or temporary measures, although they may restore the operation to normal most quickly, are not appropriate.

The key point is that each time the line is stopped because of the andon, the team strives to make sure that the same error does not happen again. Shingo states this more forcefully when he says “At Toyota, there is only one reason to stop the line—to ensure that it won’t have to stop again.”

The andon lights used by continuous integration teams come close to this philosophy. The light comes on when a build fails, or the automated tests that run as part of the build fail. Everyone on the team stops what they are doing and works on fixing the build.

What is missing sometimes is the idea of making sure that it doesn’t happen again, perhaps by implementing a Poka-yoke solution that will prevent the error from happening again.

When you read about the Toyota production system, what is striking is the commitment to doing whatever it takes to eliminate waste and errors, even at the expense of some short-team pain.

This is difficult to do in a mature company that is trying to adapt to Agile development, because it is a major change, and various departments are not used to working closely together. But it is essential to eliminate waste.

Chaos Monkey is tool developed by Netflix to test the resiliency of their servers on the Amazon cloud when faced with failures. It periodically a terminates a random virtual machine that is running their application. Their automated error recovery is supposed to spin up a new virtual machine to replace the one that failed, and do so in a manner that appears seamless to customers.

Rather than just implementing the error recovery code, testing it once, and assuming that it will do the job, they are constantly testing it, figuring that if it really can seamlessly recover from failures, there should be no problem with randomly blowing away virtual machines.

Realizing that there is the possibility that the recovery could fail, they run Chaos Monkey between 9 AM and 3 PM on weekdays, so if a problem does occur, there will be people present who can deal with it. They also have a way for applications that they know are not ready for this to opt out.

This got me thinking about testing the agility of our teams.

One of the big reasons we do Agile development is so that we can change direction at sprint boundaries, if the priorities for delivering particularly stories changes. By finishing all their work by the end of the sprint, the team is able to change direction immediately.

Some teams have trouble understanding this. They resist breaking large stories into sprint-sized pieces, because they say it will increase the overall elapsed time to implement the change. This overlooks several things:

  • If it takes you six months to implement the change, the customer's needs may have changed by the time you finish.
  • If you try to test six-months-worth of coding and it doesn't work, you have to wade through all that code to find the error. If you are implementing it in Sprint-sized stories, you only have one-Sprint's-worth of code to look through.
  • Priorities may change. The Product Owner may need to have the team implement some feature ahead of time to keep from losing a customer. If you are two months into a six-month product and you have dozens of modules open, it is very difficult to change direction.

For teams that cling to elapsed time as the only viable metric, I would propose engaging the Product Owner in the following exercise:

If you have several epics that have similar priorities, mix them up each Sprint. If the first epic has stories A, B, and C, and the second epic has stories P, Q, and R, and the team is currently working on story A, they will expect that in the next sprint they will be working on story B, and in the next, story C. Instead, have them work on story A, then P, then B, then Q, etc.

This will make transparent how agile they are, and will help get them out of the habit of assuming that if they don't finish a story in a sprint, they can just roll it over to the next with no consequences.

Of course, just as Netflix runs Chaos Monkey only during weekdays when people are present, you would want to be careful about how you do this exercise:

  • Don't do it during a deadline crunch.
  • Let the team know a sprint or two in advance that you are going to be doing this, and that you expect them to be able to do this.
  • Discuss in the Retrospective how well this worked, and what they can do to make it work better.
  • A few stubborn teams may say that this is stupid and a waste of their time. You may have to remind them that the Product Owner is responsible for the priority of stories in the backlog, and the team is responsible for committing only to what they can accomplish a Sprint.

Even teams that already buy into the idea of being able to change direction at sprint boundaries may discover impediments to doing so that they didn't see before.

Just as Netflix exercises their recovery software so they know it will work when they get a real failure, teams should exercise their agility regularly, so that when a critical customer demand comes along, they are practiced and ready for it.

Poka-yoke (the final "e" is pronounced like "eh?" in English) is the Japanese term for "error proofing", formalized by industrial engineer Shiego Shingo as part of the Toyota Production System. (He is said to have picked the term "error proofing" rather than "fool proofing" [baka-yoke] to underscore that the problem was not foolish workers, but the fact that everyone makes mistakes from time to time when given a chance.)

In the Toyota Production System, poka-yoke deals with designing vehicles and their assembly processes in such a way that it is difficult to assemble them incorrectly.

Here is an example from the 1940s of the need for poka-yoke: My father worked at Kaiser-Frazer when they used to make a car called the Frazer. It had "FRAZER" spelled out in individual chrome letters on the front of the hood, like this.

Posts that stuck out the back of each letter went through holes in the hood to hold the letters in place. The positions of the posts were standardized, making all the letters interchangeable. Sometimes, when an assembly worker was having a bad day, he would end up making ZEFFERs instead of FRAZERs. Automakers subsequently changed the letters so they could not be interchanged, either by putting the posts in a different position on each letter, or by making the logo a single unit, like this.

A more modern example is the floppy disk drives that used to be in PCs. They used an unkeyed four-pin power connector that could be installed the right way, which would make the drive work, or the wrong way, which would make it go up in smoke. You had to remember that the red wire went away from the data connector, unless you were working with the odd brand where the red wire had to go toward the data connector. (I fried several floppy disk drives this way.) That was the last unkeyed connector I remember seeing on a PC, so the PC industry evidently adopted poka-yoke.

There are all kinds of modern examples, like polarized power plugs, and IKEA furniture, which is difficult to assemble wrong, and the same concept applies to software development.

For example, suppose you have a routine that performs several different functions. (That, in itself, is probably a violation of the Single Responsibility Principle, but suppose there is a good reason for it to be that way.) Some of the functions require few parameters, while others require many. Callers have to remember to pass the right number of parameters, including dummy parameters. So you get something like this:

CALL BADRTN,(A,B,,,,,,,C,,,D),MF=(E,PARMLIST)

If you miscount the commas, you have a problem. It would be much less error-prone to have multiple entry points, each with a fixed number of parameters.

The same concept applies to development tools. In one build system I worked with, you have to type in the names of all the modules that are part of your change. If the build fails because you left out one of the changed modules, you have to re-submit the build request, again typing in the names all the modules. If you have 30 or 40 modules in your change, you might mistype or leave out one of the names you got right in the first request, causing the build to fail again. If you could just call up the first request and say you wanted to add a module, there would be much less chance of error.

Another case is anywhere that you have to enter the same information into two different places. Eventually, you will forget to update one of them, or will mistype the information while entering it the second time. If the systems can be made to talk to each other, this greatly lessens the chance for error.