itNews

This is interesting

October 1, 2004 7:51:15.938

Remember that problem in the western US with air traffic about a month ago? TechWorld has a story on it, and the problem seems to be technology and process related:

The failure was ultimately down to a combination of human error and a design glitch in the Windows servers brought in over the past three years to replace the radio system's original Unix servers, according to the FAA.

The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days. An improperly trained employee failed to reset the system, leading it to shut down without warning, the official said. Backup systems failed because of a software failure, according to a report in The New York Times.

I'm with Strongly Typed on one thing here - the above is just screaming for more information - what the heck is a "data overload"? Even so, I think we can see the outlines of the problem - using MS Windows for a critical service.

In an office environment, reboots may be a pain, but they can be done relatively easily (if a file server is unavailable for a few minutes at 2 am, few workers are going to care) - and if someone forgets to reboot said server and it crashes, it's likely to be more of an irritation than a life threatening problem. Not so in an air traffic control situation. What was the fallout from that?

The radio system shutdown, which lasted more than three hours, left 800 planes in the air without contact to air traffic control, and led to at least five cases where planes came too close to one another, according to comments by the Federal Aviation Administration reported in the LA Times and The New York Times. Air traffic controllers were reduced to using personal mobile phones to pass on warnings to controllers at other facilities, and watched close calls without being able to alert pilots, according to the LA Times report

That's a pretty high level of risk to assign to a system that is - according to the story - known to fail catastrophically on a known interval. Now, it's not like the server running this blog is a "critical" system - but I will point out that typing "uptime" at the console prompt yields an answer of 313 1/2 days (the last time that there was a power outage before IT installed a generator). You think maybe the FAA should have insisted on a system that didn't need the addition of a "reboot on a regular schedule" process? Here's the money quote:

Soon after installation, however, the FAA discovered that the system design could lead to a radio system shutdown, and put the maintenance procedure into place as a workaround, the LA Times said. The FAA reportedly said it has been working on a permanent fix but has only eliminated the problem in Seattle. The FAA is now planning to institute a second workaround - an alert that will warn controllers well before the software shuts down.

The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times.

Microsoft told Techworld it was aware of the reports but was not immediately able to comment.

I think I'd say "no comment" if I were in their shoes as well...

Update: I got a link to this MS article in the comments, pointing out that Win 95/98 systems may hang after 49.7 days (which happens to be the time interval given in the air traffic story). So.... are they really running an air traffic control system on 95/98? Seems too coincidental to me.

Comments

[Adam Vandenberg] October 1, 2004 10:03:54.000

Who knows, but there's this:

http://support.microsoft.com/default.aspx?scid=http://support.microsoft.com:80/support/kb/articles/q216/6/41.asp&NoWebContent=1
(They're running Windows 95 or 98 instead of at least NT 4.0? What the heck?)

No, they're not that stupid

[Shane King] October 1, 2004 20:36:24.000

They're not stupid enough to run a system on Windows 95. The problem is some of the application code I'd imagine - it would be storing a timer value in a 32 bit unsigned int. Which is fine, except that when the timer increments at 10ms intervals, after ~49.7 days, it wraps around. Some application programmer has managed to replicate the bug from Windows 95, hence providing the airlines with a case of back to the future.

[] October 2, 2004 10:45:11.000

Most likely, a custom app running on the server was not using the deprecated GetTickCount function properly....

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp

The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

[] October 3, 2004 19:13:51.276

I'm not sure how you can draw the conclusion that a programmer error leads to the 'outlines of the problem - using MS Windows for a critical service'.

As the SDK documentation suggests, SystemTime should be used as an alternative.

Also, I'd debate whether anyone considered Windows 95 or Windows 98, as client operating systems, should be used to run critical services.

Truth to be told...

[Jonas Galvez] October 4, 2004 5:48:32.477

...it's been more than six months since the last time shutdown my Win2k box. It's rock stable :) But, yeah, WTF, running an ATC system on Win95/98. I'm usually OK with flying but a new fear is born.

 Share Tweet This
-->