0
Select Articles

A Catch in Time PUBLIC ACCESS

Systems Give Warning Before They Fail. The Trick is to Listen-and Learn.

[+] Author Notes

James R. Chiles is a lecturer on technology and safety. He is also the author of Inviting Disaster, Lessons From the Edge o( Technology (HarperBusiness. 2001). The subject of a recent series on The History Channel. Contact him at chiles@invitingdisaster.com.

Mechanical Engineering 126(03), 36-38 (Mar 01, 2004) (3 pages) doi:10.1115/1.2004-MAR-3

This article reviews one principle that holds true across two centuries of human experience. Instead, little malfunctions and errors link up beforehand, over weeks and months. Usually, these are early warning signs, called precursors, that offer time for those at the scene to stop the chain of events. Catching system fractures happens in several stages. Alert employees notice early problems. They decide that the potential problem is serious and that preventing a disaster will need management support. In organizations like Southwest Airlines and Naval Reactors, some employees have shown that safety, excellence, and production can all fit together. For the rest of us who are not quite there yet, good fracture awareness—among front line employees as well as top decision makers—is an important step in that direction.

One principle that holds true across two centuries of human experience on the machine frontier is that technological disasters usually don't come like bolts from the blue. Instead, little malfunctions and errors link up beforehand, over weeks and months. Usually, these are early warning signs, called precursors, that offer time for those at the scene to stop the chain of events.

Although it seems unbelievable to outsiders that anyone at the site would have ignored the precursors knowing as we do afterward that something terrible really did happen-there had to be powerful reasons why nothing was done. Perhaps nobody wanted to take the career risk of raising the issue, or nobody even noticed the significance of the precursors among the daily background noise of routine problems.

One employee who saw his duty and did it was Jerry Gonsalves. In the spring of 1960, Gonsalves was working for the Cape Canaveral operations of the Glenn L. Martin Co. as a quality assurance supervisor. At the time, Martin was perfecting the Titan intercontinental ballitic missile for the Air Force.

When he came to work on the afternoon of March 31, 1960, Gonsalves found a pile of parts on his desk along with some missile blueprints. The parts had been removed in the course of oxygen tank modification work just finished on a Titan 1 missile that was ready for a test launch. His job was to match the leftover parts against the plans, then pass them along for further quality checks.

When Gonsalves counted the parts, they didn't quite match the plans. After an hour of double-checking, he decided one bolt was missing from the pile of leftovers. after finding out that none of the technicians had accidentally left the job with the bolt still in his pocket, he became convinced the stray bolt had rolled from sight down the curve of the oxygen tank, falling into an outlet at the bottom. This outlet led to the oxygen pump.

It concerned him because if the pump's metal impeller hit a bolt in the presence of liquid oxygen, it would cause a fireball that would certainly destroy the missile and probably the launch complex as well. And accidents were happening: A Titan had blown up on December 12 the previous year, destroying the pad just after launch.

Gonsalves went to his supervisor, who suggested he fly to the Martin assembly plant in Denver right away and see whether the description of leftover parts might have been off by one bolt, which would explain the discrepancy. There wasn't much time because the Titan was on Launch Complex 15 at Cape Canaveral.

The next day, Gonsalves examined brand-new Titan missiles on the Martin assembly line in Colorado. Nothing explained the bolt's absence so he sent a telegram from Denver saying his group needed to send someone into the empty LOX tank and hunt for the bolt. The most that managers were willing to do was have someone open an inspection hole at the top and peer into the tank, using binoculars and a light. Gonsalves protested, saying nobody would see the bolt that way if it had fallen into the outlet of the tank.

Gonsalves, back at the Cape on Friday, refused to put his name on the final papers needed for launch. An Air Force officer overheard the disagreement and walked up to ask what was going on. After hearing the details, the officer stopped the launch and ordered the tank drained and inspected. A manager suggested that Gonsalves not come to work the next day, a Saturday, when the Air Force would investigate the tank.

Gonsalves had no regrets: He had been hired to ensure quality and would not back down-even though he had two young children and a wife to support, and no immediate prospects of another job in the field of quality assurance engineering.

Gonsalves took a friend fishing the next day. He even caught a sailfish on the trip, the first time he'd landed one. Still better, he heard upon returning home that the Air Force had found the missing bolt in a pipe leading from the tank. The Air Force's senior officer on the project sent Gonsalves a commendation letter the following October.

Jerry Gonsalves knew from a single, subtle precursor that his corner of the Air Force test program was facing a fire and explosion, the kind of thing now called an imminent catastrophic event.

For an example that helps explain the role of precursors in the huge variety of disasters (and in the much greater number of close calls, mostly undocumented), consider how a piece of metal breaks over time. Under stress, cracks begin to grow out of tiny manufacturing flaws, corrosion, and damage during use. Then, at a critical point, a metal fracture spreads like a gunshot and the piece fails completely.

As with metal, weak points exist in virtually all systems. But instead of slag inclusions, nicks, and stress-corrosion cracks that we find in metal, weak points in a system are made up of human errors and machine malfunctions. No system is entirely free of weak points, so a good system is one in which people catch these incipient "system fractures" early, before a chain of them can link up to generate a catastrophe.

Southwest Airlines, which has never had a fatal crash in more than three decades and thousands of short-hop, time-critical flights, is one such "fracture-aware" company-and it is a profitable one.

Companies and agencies that deal in high-power, complex machines should know that technological disasters have a long-lasting cost. They should learn from the best organizations about how to catch system fractures early.

Beside the obvious toll in deaths, damage, insurance cost increases, and months of business interruption, in some cases so much public mistrust follows that an entire branch of industry is cut off. Crashes of two Comet airliners in 1954 halted British jet airliner manufacturing for so long that the American airplane makers took over the market. The scale of devastation can be enormous: A series of dam failures in China on the night of Aug. 7, 1975 , killed more than 26,000 people. Incident costs ran well over $4 billion at the Three Mile Island Unit 2 partial meltdown.

The Titan 1 missile in the air (above) and on the pad (previous page): A missing bolt led a quality assurance engineer to correct a possibly disastrous problem in the liquid oxygen tank.

Grahic Jump LocationThe Titan 1 missile in the air (above) and on the pad (previous page): A missing bolt led a quality assurance engineer to correct a possibly disastrous problem in the liquid oxygen tank.

Catching system fractures happens in several stages. Alert employees notice early problems. They decide that the potential problem is serious and that preventing a disaster will need management support.

They put together a message to management arguing for attention to the problem. Before this warning memo is fired off, however, they should consider running it past some colleagues who come up with the kind of tough questions sure to come from managers who see this as an imaginary problem. The military calls this probing, questioning process a "murder board."

Ideally, management reaches some kind of decision, rather than deferring and delaying. The decision might be to authorize a remedial action to fix the system in question, but managers could also conclude that "it ain't broke." A decision not to fix something can be as legitimate as a decision to take action, if supported by facts and something like a failure mode effects analysis. Not all signs of trouble need extraordinary action.

Finally, any lessons should go out to others who need to know. This step is often neglected by organizations that treat each mishap as an isolated case, never to be repeated. Good "lesson harvesting" might take the form of a bulletin to similar plants, or a change in manufacturing, or improved training.

One of the few fields with good tracking of incipient system fractures is aviation, with the Aviation Safety Reporting System. I drew on the most instructive close- call accounts I could locate when writing my book Inviting Disaster: Lessons From the Edge of Technology, but am always looking for more.

Leaders can do much to put front line employees in a "fracture aware" frame of mind. Are workers encouraged to pass along problems they notice, such as bolt holes not lining up on a steel-frame construction project? Do workers see evidence of solid followup by management about their concerns? Encouragement for thorough problem reporting should extend to the employees' own errors. If workers are punished for fessing up to their own mistakes, they are pretty likely to cover them up, possibly with grave consequences later.

Technological crises are inevitable in the course of all major projects, so knowing how to work through them is a valued skill. Good decision making is what we're looking for, not perfect decisions.

While he was head of the vaunted Naval Reactors organization, Admiral Hyman Rickover treated each crisis during manufacturing or testing as an opportunity, a chance to ferret out problems before a nuclear submarine went off to sea. He insisted on riding aboard each new submarine in its sea trials, so if something went wrong during pressure tests, he would be there to share the consequences of poor work or bad decisions.

While few of us enjoy troubleshooting incipient system fractures, such work is certain to be at the core of what the most valued project leaders do in our automated future. Computers are sure to take over many routine operations that people perform now, but troubleshooting is a uniquely human task and a good place for a career. " No-brainer" jobs don't have much of a future given the pace of automation, so snag one of the "brainer" jobs.

Organizations like Southwest Airlines and Naval Reactors, and some employees like Jerry Gonsalves, have shown that safety, excellence, and production can all fit together. For the rest of us who aren't quite there yet, good fracture awareness-among front line employees as well as top decision makers-is an important step in that direction.

Copyright © 2004 by ASME
View article in PDF format.

References

Figures

Tables

Errata

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In