Staggering Failure of the Software Sort

   By chris on April 30th 2008 in /dev/random | 693 views

Evidently California thought it a good idea to run their air traffic control system using software developed on Windows servers. Back in 2004 they failed, though not quite catastrophically:

The radio system shutdown, which lasted more than three hours, left 800 planes in the air without contact to air traffic control, and led to at least five cases where planes came too close to one another

Sometimes systems fail, particularly complex systems. What’s truly shocking to me about this story is this statement:

The failure was ultimately down to a combination of human error and a design glitch in the Windows servers brought in over the past three years to replace the radio system’s original Unix servers, according to the FAA.

The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days. An improperly trained employee failed to reset the system, leading it to shut down without warning, the official said. Backup systems failed because of a software failure….

To be sure there was human error as a component of this failure but it doesn’t sound to me like it was with the “improperly-trained employee” (also known as “the scapegoat”). Looks like it lies with whomever designed and developed this system. And that such an incredibly important, mission-critical system needs to be manually rebooted. Manually rebooted because the auto-reboot doesn’t work properly. An auto-reboot that’s needed because the system itself doesn’t work properly. I say to you: wtf? Seriously?

Incidentally 49.7 days (aka 2^32 milliseconds) because of a design flaw in Windows:

After 49.7 days of continuous operation, your Windows-based computer may stop responding (hang).
This problem can occur because of a timing algorithm in the Vtdapi.vxd file.

States the article:

The FAA is now planning to institute a second workaround - an alert that will warn controllers well before the software shuts down.

An alert that’s needed because the manual reboot that’s needed for the auto-reboot that’s needed for the system that doesn’t work, doesn’t work.

To steal Uncov’s tagline: “epic fail.”

I wonder if they’ve upgraded since? Perhaps to a monkey that now watches for the alarm that….

Trackback URI | Comments RSS

Leave a Reply