While functioning software allows scaling of productivity, malfunctioning software scale chaos and havoc. Naturally, some bugs may cause worse problems than others. Here are seven of the worst and most spectacular software bugs over the decades of software development, not listed in any particular order. Use these cautionary tales as motivation to test properly!
The hospital St Mary’s Mercy in Michigan, USA wrongfully informed the authorities and Social Security that 8, 500 of their patients had passed away between October 25 and December 11 of 2002. A spokeswoman for the hospital explained that an event code in the patient-management software had been wrongfully mapped: the code 20 for expired was used instead of 01 for discharged. Needless to say, a plethora of legal, billing and insurance issues followed the incident.
The state of California had been asked in 2011 to reduce its prison population by 33, 000, with preference to non-violent offenders. Again, a mapping error occurred, reversing the preference criteria and instead giving non-revocable preference to approximately 450 violent inmates, exempting them from having to report to parole officers after their release.
The market-making and trading firm Knight Capital Group were using a trading algorithm that during a single 30-minute period inexplicably decided to oppose sound economic strategy by instead buying high and selling low. The bug cost the company $440 million and its stock dropped 62 percent in just one day. The company would not describe the issue in detail but referred to it as a “large bug, infecting its market-making software”.
NASA launched the Mars Climate Orbiter by the end of 1998, but due to an error in the ground-based software, the Orbiter went missing in action after 286 days. The orbit had been incorrectly calculated, in large part thanks to different programming teams using different units. As a result, the thrust was almost 4.5 times more powerful than intended, leading to the wrong entry point into the Mars atmosphere. The $327 million Orbiter was disintegrated into pieces.
During the Gulf War and Operation Desert Storm in 1991, the Patriot Missile System failed to track and intercept a Scud missile aimed at an American base. The software had a timing issue which caused a sensor delay that would continue to grow until system reboot, with detection accuracy loss after approximately 8 hours. On the date of this incident, the system had been continuously operating for more than 100 hours, resulting in such a big a delay that the software was actually looking for the missile in the wrong place. The Scud missile hit the American barracks leading to 28 dead and over a 100 injured.
During the mid-80’s, a medical radiation therapy device called the Therac-25 was used in hospitals. The machine could operate in two modes: low-power electron beam or X-ray, which was much more powerful and only supposed to function if a metal target was placed between the device and the patient. The previous model, Therac-20, had an electromechanical physical safety to ensure that this metal target was in place, but it was decided that a software lock would replace it for the Therac-25. Due to a particular kind of bug called a “race condition” it was possible for a device operator typing fast to bypass the software check and mistakenly administer lethal radiation doses to the patient. The issue resulted in at least five deaths.
US long-distance callers using the operator AT&T on January 15th, 1990 found that no calls were going through. The long-distance switch network was infinitely rebooting for nine hours, first leading the company to believe they were being hacked. The issue turned out to be software related, in which a timing flaw lead to cascading errors.
Any malfunctioning switch was designed to reboot, which took 4-6 seconds. During this time calls were rerouted to other switches. The rebooted switch would then signal that it was back online, so that it could begin routing calls again.
The issue occurred when multiple messages were received during the reboot period, due to AT&T tweaking their code to speed up the process. This was interpreted as a sign of faulty hardware and as a safety measurement that switch would in turn reboot itself as well. And naturally, while the replacement switch was down it too received the same conflicting signal. Cascading messages between switches now made them reset each other infinitely.
The problem was eventually solved by AT&T by reducing the message load on the network, allowing all 114 switches in the network to recover. The issue cost an estimated $60 million in lost revenue first…
Stock photos by: