Towards a Learning from Failure Culture

“Learning from failure” mindset.

(Using Poka-Yoke techniques for early defect detection)

This paper describes the poka-yoke quality approach in genral, as well as our particular use of the technique in our localization efforts.

Poka-yoke is providing a simple, robust and painless way for us to detect defects early in our localization efforts.

Defects result from allowing a mistake to reach the customer, and defects are entirely avoidable. The goal of poka-yoke is to engineer the process so that mistakes can be prevented or immediately detected and corrected.

Steve Maguire (1993): How could I have automatically detected this bug? How could I have prevented this bug?

The tools in software that most closely resemble poka-yoke devices are the programs such as lint, etc. that examine the syntax of programs and alert the programmer to a possible mistake.

Areas that yield more readily to a poka-yoke approach than to traditional testing.

The sections that follow describe our application of poka-yoke principales to solve a problem that defied a traditional software testing approach.

(Poka-Yoke: applying mistake proofing to software)

It is well established that the cost of fixing a defect amplifies 10 to 100 times depending on how far the defect has gone unnoticed in the development process.

The point of a poka-yoke is. making a mistake is ok, just don’t make the same mistake again and again.

We concentrate on more pressing issues, and let poka-yokes catch common and well understood project issues.

Instead of sending multiple bold font emails, crate a poka-yoke to ensure things don’t break again.

Whenever there is something to warn to people — instead of writing long emails and wiki documents, one should pause and introspect. Ask the question: Can I redesign the system/process such that mistakes cannot be made. Or can I put in a check in software, such that if mistakes occur, then they can be caught quickly. If yes, then consider the poka-yoke.

(The Antifragile Organization — ACM)

We are awakened in the middle of the night if any part of the code breaks and impacts customers.

Each failure is an opportunity to learn, generating these questions: “How could the failure have been detected more quickly?” “How can the system be more resilient to this type of failure?” “How can this failure be induced on a regular basis?” The result is each failure makes the system more robus and resilient, analogous to the experience a warrior gains in each battle to make him stronger and fiercer in the next. The system becomes better the more times and ways it fails.

A blameless culture is fostered.

As an organization, Netflix optimizes for innovation and velocity, and it accepts that mistakes will sometimes occur, using each one as an opportunity to learn. A commonly overherad saying at Netflix is, “If we’re not making any mistakes, it means we’re not moving quickly enough.”

Mistakes are not a bad thing, unless the same mistakes are make over and over again.

(Resilience Engineering: Learning to Embrace Failure — ACM)

Me -> Improve resilience of your software development process.

Resilience engineering is a familiar concept in high-risk industries such as aviation and health care, and now it’s being adopted by large-scale Web operations as well.

In the early 2000s, Amazon created GameDay, a program designed to increase resilience by purposely injecting major failures into critical systems semi-regularly to discover flaws and subtle dependencies.

Me -> failures are both inevitable and unpredictable. –> Failures are inevitable, but not unpredictable. Probably other developers before you, probably in the same company, would have suffer the same kind of defects.

you also get an opportunity to gain confidence that the system you’ve built and the organization that’s been built up around it are actually pretty resilient.

(Uprooting Software Defects at the Source — ACM)

A May 2002 report prepared for the National Institute of Standards and Technologies (NIST)1 estimates the annual cost of software defects in the United States as \(59.5 billion. Each defect that goes undetected until after a product has shipped can cost software producers many tens of thousands of dollars to address and patch. The cost suffered by the users of that product is often orders of magnitude higher. The 2002 NIST report4 estimates that “feasible” improvements to testing infrastructures could reduce the annual cost of software defects in the United States by \)22.2 billion.

The specific improvements cited in NIST’s report fall into two categories: (1) detecting errors earlier in the software development cycle, closer to the points where they are introduced, and (2) reducing the cost of fixing defects by locating the root causes of bugs faster and with more precision.

Given this cost structure, it is not difficult to see the value of automated source code analysis. Source code analysis can identify defects at compile time as soon as the code is written. In addition, because the detected defects are derived from the code itself, source code analyzers can pinpoint the location and root cause of each error.

There are large classes of critical defects that can be detected automatically at compile time and go beyond the syntactic errors that traditional compilers catch.

The goal is to make source code analysis an industry best practice, just as the automated test suite has been for the past ten years.

providing a tool that acts as a layer of insulation against programmer errors can enable a company to prevent hundreds, if not thousands, of critical software defects from reaching production.