The FAQ on Failures, Part One
- What is an engineering failure?
- What are some examples of engineering failures?
- What lessons can be learned from engineering failures?
- How can failures be avoided?
- Can engineered systems be "safe" even in the presence of failures?
- Oh GOD! I've had a failure! What do I DO?
Aircraft Failures, by Tom Speer
Apollo 1 Accident, by Tim Nye
Mars Climate Orbiter Failure, by Ron Graham
THERAC-25 Accident, by Tim Fulcher and Ron Graham
Union Carbide Bhopal Accident, by Keith Dykes and Ron Graham
USS Thresher Failure, by John Tauxe, Timothy Reeves and Richard Rustad
The Code of Abibarshim, on the (sort of) lighter side
What is an engineering failure?
Return to Contents
For purposes of this document, an engineered system fails when it
stops working. (Usually this means it broke, or broke down, or
shut down.) A "failure" should not be mistaken for a "malfunction,"
in which case the system may work properly next time you turn it on.
(In the case of structures, however, it's pretty difficult to mistake
the two.) As far as malfunctions are concerned, though, one should
also recognize that "malfunction" + "loss of opportunity" = "failure"
even if the system does work properly next time it's used.
There are diverse opinions about when you call
a failure an "engineering failure" and when you call it something else.
This goes to fixing blame. For purposes of this document, we are not
talking about "a failure of the engineers." A few other related
working definitions:
| Definitions of Terms | |
| TERM | DEFINITION |
| risk (N) | "the chance of something going wrong," or the probability of a specific consequence happening to a specific exposed population |
| hazard (N) | what happens if something does go wrong |
| murphy (N) | the guy who says something will go wrong |
| bug (N) | what makes software not work as advertised (NOTE: this is not the same as a "missing feature") |
What are some examples of engineering failures?
Return to Contents
Whenever the subject comes up in conversations among engineers,
the following examples are likely to be mentioned prominently:
Tacoma Narrows Bridge, Tacoma WA. Suspension bridge failed after resonating in torsion for some time under wind loading. Puget Sound inlet acted as a wind tunnel for load purposes. There is a spectacular film of this failure available -- the film was taken as part of a monitoring effort by local authorities, who were not given enough time to gather all the data they wanted. No fatalities, except for one dog. Subsequent design includes an additional box beam, and even yet it "gallops," hence the moniker "Galloping Gertie."
Hyatt Regency skywalk, Kansas City MO. Two-level catwalk failed under live load, causing fatalities. Original design called for nuts where they could not actually be installed: original (impossible) design had each nut only helping to support the weight of one floor; the rod passing through the upper floor also passed the load of the lower floor through it, and not the nut holding up the upper floor. The installed design meant the nut under the upper floor not only had to support the weight of the upper floor but also the weight of the lower floor; the nut and rod were undersized to handle this load.. Upper catwalk supported itself and lower catwalk, by means of rods offset from one another. Live load may have caused harmonic oscillation by dancing. Even the static load of people exceeded design. Nuts tore through overloaded members. Blame and liability for failure were "spread around," but some licenses were still lost.
THERAC-25 cancer irradiation device. Problems with software design led to radiation overdoses at some clinics. No safety prompts were designed, and device was essentially run open-loop. The failure itself had nothing to do with these points -- it was based on an obscure bug which required very fast data entry at very specific times.
Space Shuttle Challenger, January 1986. Solid-fueled booster motor leaked combustion gases due to failure of a pressure seal (i.e., O-ring), leading to explosion of liquid fuel tanks and seven fatalities. O-ring had design problems that went unattended. Launch constraints were waived at the expense of flight safety. Safety margins were assumed based on previous successes.
Union Carbide piping systems failure, Bhopal India. Thousands killed and injured as a result of toxic vapor leak. Several safety systems were out of service and plant understaffed due to costs.
West Gate Bridge, Melbourne Australia. Two sections forming halves of deck didn't fit together well. Holding bolts at end of bridge were loosened to allow for sections to be joined, and these bolts failed under environmental loads as sections were in process of being joined. One section thus collapsed, resulting in several fatalities. A contributor writes:
My father worked as a safety officer in the nearby glassworks and was one of the first on the scene. He worked there for several days finding survivors (and more often those who didn't)... The foreman made his way down the lift to tell those in charge that [the failed section] was going to fall, and he wanted his team off the bridge. He had just left the lift cage when the span groaned and fell. By some providence he was blown out from beneath it by the air blast. And watched as most of his crew rode the span to the ground.
Quebec Bridge, above Quebec City Canada. Southern cantilever span failed during construction and fell into St. Lawrence River, killing 75 in 1907. Faulty design and inadequate supervision listed as cause. Specifically, the span was lengthened without recalculation of stresses. A new center span fell while being hoisted and killed 13 in 1916. A defective casting caused the second failure. The Quebec Bridge is thought to be the source of Canada's Iron Ring ceremony.
USS Thresher submarine sinking. May have sunk as a result of weight distribution outside of design due to heavy sonar in nose. A water inlet rupture (due to poor joint design) led to flooding faster than ballast blowers could compensate, resulting in 24 fatalities. A tape recording of the sinking is available.
Molasses tank failure, Boston MA, January 1919. A 15-meter high tank burst unexpectedly, dropping two million gallons of molasses (in a 10-meter high wall, initially) into the streets, killing 21 and injuring 150.
Patriot missile radar system, used in Operation Desert Shield. A software roundoff error increased monotonically after eight hours continuous operation. This was corrected, but not before a Scud missile killed several US Marines.
Silver Bridge, Gallipolis OH, 1966. Failed as a result of lack of redundant support in design -- one support failed under heavy rush hour load (including trucks), and bridge subsequently collapsed.
Most of the memorable failures are of structures, but by no means all. All of them involve spectacular collapses or explosions and/or fatalities.
What lessons can be learned from engineering failures?
Return to Contents
Blame notwithstanding, much can be learned from engineering failures. Most obviously, they inevitably lead to design improvements in the engineered system or structure. But side benefits may include safety precautions and enlightened management behavior, as well as research and experimentation into the causes of failures.
Most of the failures above involved a lack of checks in the design and implemenation of whatever failed. Such checks would have cost money, or time, or both. So one might conclude that corners were cut in most cases, especially since in nearly all cases design flaws either were detected (e.g. Challenger) or could have been detected (e.g. Hyatt skywalk) before failure, if not before being put into service.
In many cases, the solution to the question "what caused the failure?" did not become clear until the engineers were looking for these solutions. "If it ain't broke, don't fix it" could lead to a team of engineers trying to figure out why it did break, instead of theorizing reasons why it might.
Petroski, in Design Paradigms, has several interesting things to say about root causes of failures:
- Galileo first wrote on scale effects in "Dialogues Concerning Two New Sciences," a work on dynamics and strength of materials, in 1638. He showed, using beams, that pure geometric reasoning is insufficient in scaling up structures.
- William Fairbairn, in "Treatise on Iron Shipbuilding" (1865), showed that ships at sea can be supported alternately by one wave or by two, and thus can fail either in deck or hull with about equal likelihood.
- Improvements in building methods and materials led to a rediscovery of hull failure modes, as cracks would form in Liberty Ships due to welding-induced embrittlement of the hull, with those cracks propagating due to lack of rivets (which were present in previous designs) that would interrupt crack paths.
- Analysis of pencil leads showed that what initiates the failure (in this case axial tensile stress) may not be the same thing as what propagates it (e.g. tensile and shear).
Petroski reminds us that one failure will disprove any hypothesis. :-)
Return to Contents
DESIGN
Build redundancy into design. To what extent is a function of the cost of a failure. (Although in many cases we look only at the cost of the redundancy itself.) An example would be FAA regulations, which allow for no single-point total failures. Redundancy is also a function of reliability needed from the engineered system and availability of spares, as seen in the above table.
Redunancy is not complete unless redundant systems are functionally isolated from one another. Otherwise, the failure of one could be partially responsible for the failure of another.
Make use of spares when the components in question
- are inexpensive relative to system price
- fail often
- fail independently of other components
- can be replaced easily (with no special skills)
- are easy to store
These ideas go together surprisingly often.
Be aware of details, such as (for structures) corners, connections and reinforcements. In those instances there are stress concentrations. A structure has (in general) a safety factor: that factor is only as good as the weakest part of the structure.
Standardize suppliers for devices where possible. This minimizes the number of interfaces to be dealt with by end users.
Watch out for problems of scale, as noted by Galileo. Scaling is not limited to geometry, either -- consider also how your design may change when moved from static/steady-state to dynamic/transient; consider environmental extremes as well.
There are analogous phenomena in other disciplines. For instance, in electrical or piping systems, interfaces have to be considered as prime candidates for failures. As seen in the case of Apollo 1, even the proximity of systems that might act at cross-purposes to one another must be considered.
OPERATION
If people are in the loop, conduct massive manned tests in varying environments. Even if not, testing is your last check. Such testing must consider as many aspects of expected use, environmental conditions and repeatability as possible.
If operator error is a possibility, training and retraining is a solution. Simplicity of interfaces is a necessity: according to Low ("What Made Apollo a Success"), only 100 wires connected the Apollo module to the Saturn booster.
Software-governed systems are among those particularly sensitive to intentional sabotage; but more importantly, steps must be taken to avoid bugs. Among things that can be done in this area are (again) redundancy in algorithms and the use of checksums or other sanity checks.
If alarms are to be filtered for any reason, it must be established in advance that the alarms that are disabled are not relevant for the given mode of operation. End users should not disable alarms without checking with the manufacturer or those responsible for repair.
If systems are changed, manuals must change with them. If operating instructions are changed, manuals must reflect all changes. [This seems self-evident -- but how many times do we find that the documentation is an afterthought?]
MANAGEMENT
Exercise controls: inspection of materials, improvement of procedures. The controls are modified whenever
- the materials change
- new failure mechanisms are discovered
- new requirements are set
- new analysis/testing/detection tools are developed
"I did what I was told" is a poor excuse for following a poor course of action. The engineer must avoid (a) throwing work "over the wall" and (b) fear of management reprisal for warnings of potential failure.
Independent verification does not guarantee that errors will be detected. People can screw up in the same way if they look at the problem in the same way. (In this sense, "diversity" is good. Maybe essential.)
There is a tendency in all of us to want to know and do more. Florman calls it "the tasting of new fruit." We will tend to be enlarging or otherwise improving our engineered systems, looking for cost savings or enhanced performance. With this will come unforeseen results. So when implementing improvements, remember all of the above.
MAINTENANCE
On dealing with service of equipment (a common cause of accidents), remember the following:
- remove energy inputs (e.g. electrical, hydraulic, steam, etc.)
- release or contain energy stored in the system (e.g. capacitors, high-pressure hoses, springs, etc.)
- disable any control system, or any device governed by the control system, so as to minimize danger from external inputs to equipment states
A maintenance profile consists of the following:
- ease of maintenance
- maintenance cost v. replacement cost
- separation of components having different reliability
- probability of component failure
- self-test and remote-test capability
...where a self-test ends in some sort of alarm if a component fails or malfunctions (without user input), and a remote test can be carried out over a phone line.
MATERIALS
Use materials that can withstand the loads and environments associated with operation.
Build in conservatism in materials selection and implementation. A material is typically assigned a strength two standard deviations below its mean value (called 'specified minimum yield strength', for example).
PRODUCTION
Ensure that all equipment is operating correctly and that all personnel are qualified.
Require inspection and test methods that eliminate defective components and structures. Pay particular attention to welds.
Adhere strictly to applicable codes, such as the ASME Boiler and Pressure Vessel Code.
Can engineered systems be "safe" even in the presence of failures?
Return to Contents
This subject comes up periodically in discussions among engineers, and almost constantly (for software-governed systems) among software engineers. The short answer for most engineered systems is YES. For software-governed systems, safety is most likely to be ensured by some combination of the following actions, depending on need:
- Provide independent development and support staff.
- Shield adequately against electromagnetic interference (EMI).
- Provide complete, timely and readable system diagnostics.
- Guard against accumulated round-off error.
- Provide a backup human-in-the-loop control option.
- Protect the system against power failure.
- Provide sufficient duty cycle to accommodate multiple tasks if they're all essential.
Many of these will protect your system from failure altogether, but they are essential when your system is safety-critical.
For systems which can endanger life or property, there are other actions that can be taken, depending on system and need:
- Provide redundant fastening to prevent collapses.
- Provide containment of explosions or hazardous flows.
- Provide adequate protective shields.
- Provide sufficient alarms.
- Provide a buffer zone between failing system and nearby populated areas.
- Ensure that dangerous releases or shrapnel dissipate or otherwise reach a low-energy state quickly.
Insurance companies can provide risk assessments that determine the level of protection needed for a given system.
Accidents, since they often have multiple causes, can be more complex than failures. If you are able to remove or minimize the risk of all individual failures, you will then minimize the probability of an accident. Among the primary reasons for system safety to be judged in terms of risk is that the designers and engineers may not have control over conditions that can lead to an accident (e.g. operator alertness, weather, and luck) -- for this reason, risk assessment will attach some probability of failure to single causes, and determine a composite probability of an accident.
Oh GOD! I've had a failure! What do I DO?
Return to Contents
First of all, get hold of yourself. Although the failures cited as examples in this FAQ were all spectacular, and led to loss of life and substantial expense, those are not your garden-variety failures. What most individual engineers will experience in the way of failures will be failures of replaceable components in systems that aren't safety-critical, or of functions that have little hazard, causing down time far more often than injury. As indicated in other sections of the FAQ, failure of more complex or safety-critical systems may be detectable prior to service if the right precautionary steps are followed; or may be containable if not detected in time, again if the right precautionary steps are followed.
If you are involved in development of a system whose failure causes injury or loss of life, however, then you may be the center of attraction in a lengthy failure investigation, if not worse. But you already know that, and that's why the thought of failure gives you the willies. But knowing that your first failure will probably not be like that should help a little.
Once you've gotten hold of yourself, the next thing to do is learn from the failure. You might learn a lot, but that's not where the learning ends. You have to convince your customers, and maybe the world, that
- You know what caused the failure.
- You know how to fix it. (Or better yet, you have fixed it already.)
- You know how to keep it from happening again.
As you can guess, the more complex the system the harder it is to find the lessons; the more spectacular the failure the harder it is to convince anyone you've learned them. The flip side is that determining the cause is the most difficult part. Once you have, then fixes and prevention methods often follow.
In learning from the failure, you may be called upon to abandon preconceived notions either about how your system works, or about the environment it's working in. In Vaughan, The Challenger Launch Decision, we see that NASA and contractor engineers (for instance) were faced with O-rings that had been damaged in rocket test firings and were allowed to fly on the Shuttle anyway. Why was this? Because they had based their estimates of the probability of O-ring failure on three measures that could not account for what had in their minds been a worst-on-worst case scenario. Those three measures are
Experience base. This was limited to some two dozen Shuttle flights before the failure of 51-L, but as pertained to the O-rings, there were a much larger number of test firings from which to gain data as well. None of the test firings completely failed redundant O-rings, and some were made at conditions (of thrust and O-ring alignment error) that exceeded what they could reasonably expect in service, especially in terms of ambient temperature.
Safety margins. Their safety margins were determined both through analysis (high-fidelity computer simulation) and test, and one source verified the other. When O-rings experienced erosion in testing, the cause was isolated and fixed prior to the next flight. Furthermore, there was always a redundant O-ring in service.
Self-limiting phenomena. The O-ring burn-through was a self-limiting phenomenon because hot gas impingement on the joint where the O-ring was placed would stop when pressure equalized in the joint.
Their three measures led to the ability to see risk as acceptable even in the presence of personal concern on the part of the engineers. (I could go on and on. Read the book. It's long, but absolutely fascinating.) Anyway, to those three measures you basically have to add
The worst-case environment for your system. In the case of the Challenger, just a couple Shuttle flights before NASA had witnessed the coldest liftoff in Cape history. Well, in the case of 51-L it happened again. Designers of buildings have to choose design wind loads on the basis of weather history, and on how long they foresee the building remaining in service. (If the building is only to stay up 20 years, you don't design to the 100-year worst-case wind.) You have to do the same.
Some attention paid to observations. The NASA culture of the time seemed to indicate that individual observations didn't count as "hard data." Well, when it's an expert's eyes maybe it does. And the observations must count not only for system behavior, but for organizational feeling about the system as well.
When you are involved in a failure investigation, you may have several options:
- Visual Examination of surface conditions
- Non-destructive Testing to locate subsurface cracking
- Fractographic Examination of the topography of surface features
- Metallographic Examination to locate imperfections in machining, forging, welding, casting, brazing, or fastening
- Mechanical Testing of hardness, heat treating, wear resistance, or tensile strength
- Chemical Analysis to verify that the component material is what was specified, or to find surface conditions in small localized areas
References and Resources
- "Structural Failures: Modes, Causes, Responsibilities." ASCE National Meeting on Structural Engineering, Cleveland OH, 1972.
- Casey, Set Phasers on Stun (Editor's Choice)
- Dickson, et. al. Failure Analysis: Techniques and Applications, ASM International, Materials Park, OH, 1992.
- Esakul, et al. Handbook of Case Histories in Failure Analysis, ASM International, Materials Park, Ohio, 2 vols., 1992.
- Florman, The Existential Pleasures of Engineering (Editor's Choice)
- Hertzberg, Deformation and Fracture Mechanics of Engineering Materials
- Jones, Engineering Materials 3: Materials Failure Analysis -- Case Studies and Design Implications
- Kaminetzky, Design and Construction Failures
- Kepner, The New Rational Manager
- Kletz, What Went Wrong?
- Kletz, An Engineer's View of Human Error
- Kletz, Lessons from Disaster: How Organizations Have No Memory and Accidents Recur
- Levy and Salvadori, Why Buildings Fall Down
- Lorenzo, A Manager's Guide to Reducing Human Errors: Improving Human Performance in the Chemical Industry. Chemical Manufacturers Assn., 1990.
- Lovell and Kluger, Apollo 13 (Editor's Choice)
- Murray and Cox, Apollo: The Race to the Moon
- Nishida, Failure Analysis in Engineering Applications
- Peterson, Fatal Defect: Chasing Killer Computer Bugs (Editor's Choice)
- Petroski, To Engineer is Human (Editor's Choice)
- Petroski, Design Paradigms: Case Histories of Error and Judgment in Engineering (Editor's Choice)
- Rogers et. al. "Report to the President by the Presidential Commission on the Space Shuttle Challenger Accident." Washington DC, June 1986.
- Vaughan, The Challenger Launch Decision (Editor's Choice)
- Witherell, Engineering Failure Analysis
- Case study catalog (Carleton)
- Tacoma Narrows Bridge history (Washington State DOT)
- Tacoma Narrows movie with the failure actually explained
- comp.risks digest (where VL.IS. = volume #.issue #.)
- computer failures (Editor's Choice)
Authors
Robert Alheid
Eric W. Anderson
Sami Atallah
Tim Blanton
John Caufield
Joseph I. Chiu
Ted Cochran
Robert Currie
Keith Dykes
earl
Peter Floyd
Maura Gatensby
Ron Graham (Editor)
Brian Gross
George Gumas
Damien Holloway
Joseph L. Jones
Soren LaForce
David Levan
Nancy Leveson
Tim Nye
Kevin O'Connell
Jim Petroski
Timothy R. Reeves
Al Rosenfield
Richard H. Rustad
Russell W. Schmidt
Ian Smith
Tom Speer
John Stevens
John Tauxe
Andy Weilert
Eric Wiersma
Joseph Wilson
Christopher Wright
Ken Zagzebski

The Challenger failure, 1986








