The FAQ on Failures, Part One

Contents of Part One

  1. What is an engineering failure?
  2. What are some examples of engineering failures?
  3. What lessons can be learned from engineering failures?
  4. How can failures be avoided?
  5. Can engineered systems be "safe" even in the presence of failures?
  6. Oh GOD! I've had a failure! What do I DO?

Contents of Part Two

Aircraft Failures, by Tom Speer
Apollo 1 Accident, by Tim Nye
Mars Climate Orbiter Failure, by Ron Graham
THERAC-25 Accident, by Tim Fulcher and Ron Graham
Union Carbide Bhopal Accident, by Keith Dykes and Ron Graham
USS Thresher Failure, by John Tauxe, Timothy Reeves and Richard Rustad
The Code of Abibarshim, on the (sort of) lighter side


What is an engineering failure?

Return to Contents

For purposes of this document, an engineered system fails when it stops working. (Usually this means it broke, or broke down, or shut down.) A "failure" should not be mistaken for a "malfunction," in which case the system may work properly next time you turn it on. (In the case of structures, however, it's pretty difficult to mistake the two.) As far as malfunctions are concerned, though, one should also recognize that "malfunction" + "loss of opportunity" = "failure" even if the system does work properly next time it's used.

There are diverse opinions about when you call a failure an "engineering failure" and when you call it something else. This goes to fixing blame. For purposes of this document, we are not talking about "a failure of the engineers." A few other related working definitions:

Definitions of Terms
TERM DEFINITION
risk (N) "the chance of something going wrong," or the probability of a specific consequence happening to a specific exposed population
hazard (N) what happens if something does go wrong
murphy (N) the guy who says something will go wrong
bug (N) what makes software not work as advertised (NOTE: this is not the same as a "missing feature")

What are some examples of engineering failures?

Return to Contents

Whenever the subject comes up in conversations among engineers, the following examples are likely to be mentioned prominently:

Tacoma Narrows Bridge, Tacoma WA. Suspension bridge failed after resonating in torsion for some time under wind loading. Puget Sound inlet acted as a wind tunnel for load purposes. There is a spectacular film of this failure available -- the film was taken as part of a monitoring effort by local authorities, who were not given enough time to gather all the data they wanted. No fatalities, except for one dog. Subsequent design includes an additional box beam, and even yet it "gallops," hence the moniker "Galloping Gertie."

Hyatt Regency skywalk, Kansas City MO. Two-level catwalk failed under live load, causing fatalities. Original design called for nuts where they could not actually be installed: original (impossible) design had each nut only helping to support the weight of one floor; the rod passing through the upper floor also passed the load of the lower floor through it, and not the nut holding up the upper floor. The installed design meant the nut under the upper floor not only had to support the weight of the upper floor but also the weight of the lower floor; the nut and rod were undersized to handle this load.. Upper catwalk supported itself and lower catwalk, by means of rods offset from one another. Live load may have caused harmonic oscillation by dancing. Even the static load of people exceeded design. Nuts tore through overloaded members. Blame and liability for failure were "spread around," but some licenses were still lost.

THERAC-25 cancer irradiation device. Problems with software design led to radiation overdoses at some clinics. No safety prompts were designed, and device was essentially run open-loop. The failure itself had nothing to do with these points -- it was based on an obscure bug which required very fast data entry at very specific times.

Space Shuttle Challenger, January 1986. Solid-fueled booster motor leaked combustion gases due to failure of a pressure seal (i.e., O-ring), leading to explosion of liquid fuel tanks and seven fatalities. O-ring had design problems that went unattended. Launch constraints were waived at the expense of flight safety. Safety margins were assumed based on previous successes.

Union Carbide piping systems failure, Bhopal India. Thousands killed and injured as a result of toxic vapor leak. Several safety systems were out of service and plant understaffed due to costs.

West Gate Bridge, Melbourne Australia. Two sections forming halves of deck didn't fit together well. Holding bolts at end of bridge were loosened to allow for sections to be joined, and these bolts failed under environmental loads as sections were in process of being joined. One section thus collapsed, resulting in several fatalities. A contributor writes:

My father worked as a safety officer in the nearby glassworks and was one of the first on the scene. He worked there for several days finding survivors (and more often those who didn't)... The foreman made his way down the lift to tell those in charge that [the failed section] was going to fall, and he wanted his team off the bridge. He had just left the lift cage when the span groaned and fell. By some providence he was blown out from beneath it by the air blast. And watched as most of his crew rode the span to the ground.

Quebec Bridge, above Quebec City Canada. Southern cantilever span failed during construction and fell into St. Lawrence River, killing 75 in 1907. Faulty design and inadequate supervision listed as cause. Specifically, the span was lengthened without recalculation of stresses. A new center span fell while being hoisted and killed 13 in 1916. A defective casting caused the second failure. The Quebec Bridge is thought to be the source of Canada's Iron Ring ceremony.

USS Thresher submarine sinking. May have sunk as a result of weight distribution outside of design due to heavy sonar in nose. A water inlet rupture (due to poor joint design) led to flooding faster than ballast blowers could compensate, resulting in 24 fatalities. A tape recording of the sinking is available.

Molasses tank failure, Boston MA, January 1919. A 15-meter high tank burst unexpectedly, dropping two million gallons of molasses (in a 10-meter high wall, initially) into the streets, killing 21 and injuring 150.

Patriot missile radar system, used in Operation Desert Shield. A software roundoff error increased monotonically after eight hours continuous operation. This was corrected, but not before a Scud missile killed several US Marines.

Silver Bridge, Gallipolis OH, 1966. Failed as a result of lack of redundant support in design -- one support failed under heavy rush hour load (including trucks), and bridge subsequently collapsed.

Most of the memorable failures are of structures, but by no means all. All of them involve spectacular collapses or explosions and/or fatalities.


What lessons can be learned from engineering failures?

Return to Contents

Blame notwithstanding, much can be learned from engineering failures. Most obviously, they inevitably lead to design improvements in the engineered system or structure. But side benefits may include safety precautions and enlightened management behavior, as well as research and experimentation into the causes of failures.

Most of the failures above involved a lack of checks in the design and implemenation of whatever failed. Such checks would have cost money, or time, or both. So one might conclude that corners were cut in most cases, especially since in nearly all cases design flaws either were detected (e.g. Challenger) or could have been detected (e.g. Hyatt skywalk) before failure, if not before being put into service.

In many cases, the solution to the question "what caused the failure?" did not become clear until the engineers were looking for these solutions. "If it ain't broke, don't fix it" could lead to a team of engineers trying to figure out why it did break, instead of theorizing reasons why it might.

Petroski, in Design Paradigms, has several interesting things to say about root causes of failures:

Petroski reminds us that one failure will disprove any hypothesis. :-)


How can failures be avoided?

Return to Contents

DESIGN

Build redundancy into design. To what extent is a function of the cost of a failure. (Although in many cases we look only at the cost of the redundancy itself.) An example would be FAA regulations, which allow for no single-point total failures. Redundancy is also a function of reliability needed from the engineered system and availability of spares, as seen in the above table.

Redunancy is not complete unless redundant systems are functionally isolated from one another. Otherwise, the failure of one could be partially responsible for the failure of another.

Make use of spares when the components in question

These ideas go together surprisingly often.

Be aware of details, such as (for structures) corners, connections and reinforcements. In those instances there are stress concentrations. A structure has (in general) a safety factor: that factor is only as good as the weakest part of the structure.

Standardize suppliers for devices where possible. This minimizes the number of interfaces to be dealt with by end users.

Watch out for problems of scale, as noted by Galileo. Scaling is not limited to geometry, either -- consider also how your design may change when moved from static/steady-state to dynamic/transient; consider environmental extremes as well.

There are analogous phenomena in other disciplines. For instance, in electrical or piping systems, interfaces have to be considered as prime candidates for failures. As seen in the case of Apollo 1, even the proximity of systems that might act at cross-purposes to one another must be considered.

OPERATION

If people are in the loop, conduct massive manned tests in varying environments. Even if not, testing is your last check. Such testing must consider as many aspects of expected use, environmental conditions and repeatability as possible.

If operator error is a possibility, training and retraining is a solution. Simplicity of interfaces is a necessity: according to Low ("What Made Apollo a Success"), only 100 wires connected the Apollo module to the Saturn booster.

Software-governed systems are among those particularly sensitive to intentional sabotage; but more importantly, steps must be taken to avoid bugs. Among things that can be done in this area are (again) redundancy in algorithms and the use of checksums or other sanity checks.

If alarms are to be filtered for any reason, it must be established in advance that the alarms that are disabled are not relevant for the given mode of operation. End users should not disable alarms without checking with the manufacturer or those responsible for repair.

If systems are changed, manuals must change with them. If operating instructions are changed, manuals must reflect all changes. [This seems self-evident -- but how many times do we find that the documentation is an afterthought?]

MANAGEMENT

Exercise controls: inspection of materials, improvement of procedures. The controls are modified whenever

"I did what I was told" is a poor excuse for following a poor course of action. The engineer must avoid (a) throwing work "over the wall" and (b) fear of management reprisal for warnings of potential failure.

Independent verification does not guarantee that errors will be detected. People can screw up in the same way if they look at the problem in the same way. (In this sense, "diversity" is good. Maybe essential.)

There is a tendency in all of us to want to know and do more. Florman calls it "the tasting of new fruit." We will tend to be enlarging or otherwise improving our engineered systems, looking for cost savings or enhanced performance. With this will come unforeseen results. So when implementing improvements, remember all of the above.

MAINTENANCE

On dealing with service of equipment (a common cause of accidents), remember the following:

A maintenance profile consists of the following:

...where a self-test ends in some sort of alarm if a component fails or malfunctions (without user input), and a remote test can be carried out over a phone line.

MATERIALS

Use materials that can withstand the loads and environments associated with operation.

Build in conservatism in materials selection and implementation. A material is typically assigned a strength two standard deviations below its mean value (called 'specified minimum yield strength', for example).

PRODUCTION

Ensure that all equipment is operating correctly and that all personnel are qualified.

Require inspection and test methods that eliminate defective components and structures. Pay particular attention to welds.

Adhere strictly to applicable codes, such as the ASME Boiler and Pressure Vessel Code.


Can engineered systems be "safe" even in the presence of failures?

Return to Contents

This subject comes up periodically in discussions among engineers, and almost constantly (for software-governed systems) among software engineers. The short answer for most engineered systems is YES. For software-governed systems, safety is most likely to be ensured by some combination of the following actions, depending on need:

Many of these will protect your system from failure altogether, but they are essential when your system is safety-critical.

For systems which can endanger life or property, there are other actions that can be taken, depending on system and need:

Insurance companies can provide risk assessments that determine the level of protection needed for a given system.

Accidents, since they often have multiple causes, can be more complex than failures. If you are able to remove or minimize the risk of all individual failures, you will then minimize the probability of an accident. Among the primary reasons for system safety to be judged in terms of risk is that the designers and engineers may not have control over conditions that can lead to an accident (e.g. operator alertness, weather, and luck) -- for this reason, risk assessment will attach some probability of failure to single causes, and determine a composite probability of an accident.


Oh GOD! I've had a failure! What do I DO?

Return to Contents

First of all, get hold of yourself. Although the failures cited as examples in this FAQ were all spectacular, and led to loss of life and substantial expense, those are not your garden-variety failures. What most individual engineers will experience in the way of failures will be failures of replaceable components in systems that aren't safety-critical, or of functions that have little hazard, causing down time far more often than injury. As indicated in other sections of the FAQ, failure of more complex or safety-critical systems may be detectable prior to service if the right precautionary steps are followed; or may be containable if not detected in time, again if the right precautionary steps are followed.

If you are involved in development of a system whose failure causes injury or loss of life, however, then you may be the center of attraction in a lengthy failure investigation, if not worse. But you already know that, and that's why the thought of failure gives you the willies. But knowing that your first failure will probably not be like that should help a little.

Once you've gotten hold of yourself, the next thing to do is learn from the failure. You might learn a lot, but that's not where the learning ends. You have to convince your customers, and maybe the world, that

As you can guess, the more complex the system the harder it is to find the lessons; the more spectacular the failure the harder it is to convince anyone you've learned them. The flip side is that determining the cause is the most difficult part. Once you have, then fixes and prevention methods often follow.

In learning from the failure, you may be called upon to abandon preconceived notions either about how your system works, or about the environment it's working in. In Vaughan, The Challenger Launch Decision, we see that NASA and contractor engineers (for instance) were faced with O-rings that had been damaged in rocket test firings and were allowed to fly on the Shuttle anyway. Why was this? Because they had based their estimates of the probability of O-ring failure on three measures that could not account for what had in their minds been a worst-on-worst case scenario. Those three measures are

Experience base. This was limited to some two dozen Shuttle flights before the failure of 51-L, but as pertained to the O-rings, there were a much larger number of test firings from which to gain data as well. None of the test firings completely failed redundant O-rings, and some were made at conditions (of thrust and O-ring alignment error) that exceeded what they could reasonably expect in service, especially in terms of ambient temperature.

Safety margins. Their safety margins were determined both through analysis (high-fidelity computer simulation) and test, and one source verified the other. When O-rings experienced erosion in testing, the cause was isolated and fixed prior to the next flight. Furthermore, there was always a redundant O-ring in service.

Self-limiting phenomena. The O-ring burn-through was a self-limiting phenomenon because hot gas impingement on the joint where the O-ring was placed would stop when pressure equalized in the joint.

Their three measures led to the ability to see risk as acceptable even in the presence of personal concern on the part of the engineers. (I could go on and on. Read the book. It's long, but absolutely fascinating.) Anyway, to those three measures you basically have to add

The worst-case environment for your system. In the case of the Challenger, just a couple Shuttle flights before NASA had witnessed the coldest liftoff in Cape history. Well, in the case of 51-L it happened again. Designers of buildings have to choose design wind loads on the basis of weather history, and on how long they foresee the building remaining in service. (If the building is only to stay up 20 years, you don't design to the 100-year worst-case wind.) You have to do the same.

Some attention paid to observations. The NASA culture of the time seemed to indicate that individual observations didn't count as "hard data." Well, when it's an expert's eyes maybe it does. And the observations must count not only for system behavior, but for organizational feeling about the system as well.

When you are involved in a failure investigation, you may have several options:


View Ron Graham's profile on LinkedIn

References and Resources


Authors

Robert Alheid
Eric W. Anderson
Sami Atallah
Tim Blanton
John Caufield
Joseph I. Chiu
Ted Cochran
Robert Currie
Keith Dykes
earl
Peter Floyd
Maura Gatensby
Ron Graham (Editor)
Brian Gross
George Gumas
Damien Holloway
Joseph L. Jones
Soren LaForce
David Levan
Nancy Leveson
Tim Nye
Kevin O'Connell
Jim Petroski
Timothy R. Reeves
Al Rosenfield
Richard H. Rustad
Russell W. Schmidt
Ian Smith
Tom Speer
John Stevens
John Tauxe
Andy Weilert
Eric Wiersma
Joseph Wilson
Christopher Wright
Ken Zagzebski

Challenger
The Challenger failure, 1986



Petroski, To Engineer is Human

Florman, The Existential Pleasures of Engineering

Levi and Salvadori, Why Buildings Fall Down

Kletz, What Went Wrong?

Casey, Set Phasers on Stun

Lovell and Kluger, Apollo 13

Peterson, Fatal Defect

Vaughan, The Challenger Launch Decision