FAQ on Failures, Part Two: Mars Climate Orbiter Failure
Ron Graham
Reference: Oberg, J. "Why the Mars Probe Went Off Course." IEEE Spectrum, December 1999.
The Mars Climate Orbiter failure was caused by much more than a simple unit-conversion error. Some of the same patterns of organizational behavior seen leading up to the loss of the MCO were also seen in the events prior to the 1986 Space Shuttle Challenger accident.
The MCO featured as part of its attitude control system a momentum wheel, which spins like a top and is mounted along the spacecraft pitch axis. When a torque is applied to the momentum wheel, its spin rate accelerates -- and momentum is exchanged with the spacecraft, which accelerates about the pitch axis in the opposite direction. Pitch rate and attitude errors are fed back to govern momentum wheel torques, and the resulting momentum exchange compensates for these errors.
The spacecraft, as is usually the case, was not symmetric. Solar arrays on one side of the pitch axis were pushed by solar pressure disturbances
-- periodic and on the order of 10-6 to 10-4 lbf --
...causing acceleration about the pitch axis. Though the momentum wheel can compensate for these disturbances, it can't do that alone with a limited spin rate. From time to time, a "momentum dump" is performed -- the wheel is decelerated and attitude control thrusters take over pitch control.
Thrusters are controlled via pulse-width modulation. Since they fire with constant force, the control system instead varies the duration of each firing. Since the thrusters are also asymmetric with respect to the pitch axis, a firing will over-compensate for an attitude error if the pulse width is not correctly calculated.
This is where the unit-conversion problem came in: since the wrong units were used to represent disturbance forces, thrusters were fired about for about four times the duration needed during a dump. Since several dumps were performed, there was some opportunity to catch an error. Here's why the error, though caught, went uncorrected until it was too late:
- Though the mission control staff wasn't sure the attitude control system was working properly, they weren't sure how unsure they were, if you catch my meaning. (Oberg calls this "inadequate robustness.")
- Though there were computer codes available for math modeling of attitude control and trajectory planning, the staff wasn't sure what excessive thruster burns would really do. The flight software was mature, and Oberg suggests that perhaps nobody involved in developing the software was around to watch this flight. This meant nobody on duty understood the software.
- There was a failure to heed warnings. A previous probe had been lost as recently as 1993, and the Mars environment was known to be harsh, so high precision was needed. Management must share some criticism here -- if there's one thing Challenger should have taught us, it's that we have to "prove it's RIGHT," not "prove it's WRONG." Instead, in this case, management would take no action without conviction that something was wrong, which came too late.
