Killed by a Machine: The Therac-25

lpage · on Aug 1, 2016

My uncle was a radiation oncologist who worked with machines like this one. As a young child in the early 90s he took me to the hospital where he worked to watch him and their staff physicist calibrate a linear accelerator - an experience I remember vividly. A very long checklist culminated in them irradiating a plate of acrylic, which I got to keep. The plate was about 20cm x 30cm x 4cm and aligned such that the beam would strike the 4cm face and travel down the 20cm side, fanning out in the process. The result looked like a thinner version of this: https://en.wikipedia.org/wiki/Lichtenberg_figure#/media/File...

I asked him how they could possibly use such machines on humans, given what I just watched it do to a 1kg plate of acrylic. He told me that they hit the plate with way more energy than they ever would a human. That prompted my followup question: "uncle, what happens if you accidentally hit the wrong button?" He told me that accidents like that used to happen, but that the machines they used had special computers to keep the patient safe even if he made a mistake. "But uncle, what happens if the computer makes a mistake!?" I had no idea what a bug, or for that matter code was. He didn't have a good answer beyond "the computer can't make mistakes like we do." Having played with computers enough by then to know that his statement wasn't entirely true, I ended that outing wanting to know more. I started obsessing over what would happen to people if computers controlling things like that linear accelerator, or even the elevator in my dad's office building, made a mistake.

Incidentally, my uncle was the one who got me interested in science, and that trip to the hospital got me using computers for something other than games. Fast forward 24 years and...well...part of what I do is work on provably correct systems.

kqr · on Aug 1, 2016

Are there any statistics on the safety/reliability of software controlled vs. hardware controlled devices?

I work in software engineering, so I'm exposed daily to broken software controls, and I'm gradually becoming more of a "grumpy old man" longing for the good old days when machines (be it cameras, cars, watches, medical equipment or anything else, really) could be "debugged" by following levers, wires, physical stops and hoses. I feel much more comfortable with that than with complex computer systems.

I would like to know if my fears can be founded in actual truths, or if I'm just riling myself up over nothing.

HeyLaughingBoy · on Aug 1, 2016

There are but I don't know where to point you :-(

I spent over 10 years writing control software for medical devices. In classifying the criticality of a device, the FDA from the beginning put any device with software into a higher category. They have since refined their classifications, so you probably won't find that guidance on their site anymore, but 15 years ago it was pretty eye opening to me that two devices of the same type, one with and one without software, could be automatically in different safety categories.

So, no, it's not just you :-)

jes · on Aug 1, 2016

I don't remember the details at this point, but somewhere in the IEC 62304 regulation, there is language that suggests that you have to assume 100% failure for software components.

I think it's in regard to risk controls, but I'm not certain of that.

Edit: Just looked it up: "If the HAZARD could arise from the failure of the SOFTWARE SYSTEM to behave as specified, the probability of such failure shall be assumed to be 100%"

HeyLaughingBoy · on Aug 1, 2016

Yes, this is to avoid the problem many an FMEA (Failure Modes Effects Analysis) meeting agonizes over: "it's possible for the software to fail in this way, but how do we quantify the probability of it happening so we can calculate the risk factor?" By assuming that if it can happen, then it will" you remove any handwaving from the equation.

jes · on Aug 3, 2016

I think the language has the side-effect that you describe, and I think that it is beneficial, but I don't know if that is the reason that the language was added to the IEC-62304 regulation.

Thank you for a helpful observation. It had not occurred to me.

crispyambulance · on Aug 1, 2016

I have no stats, but tend agree with the famous Bob Pease who once said "My favorite programming language is... solder!" [https://en.wikipedia.org/wiki/Bob_Pease]

HeyLaughingBoy · on Aug 1, 2016

Actually, that was Steve Ciarcia https://en.wikipedia.org/wiki/Steve_Ciarcia, not Bob Pease.

notalaser · on Aug 1, 2016

Purely hardware-controlled vs. purely software-controlled devices are somewhat difficult to evaluate (partly because purely X-controlled is somewhat of a rarity).

The "trick", or rather, the true test of a design when it comes to this is judiciously balancing the advantages that each of these categories bring you. You generally want to use "hardware" means in order to nullify the important risks of programmable logic (i.e. the major consequences of a blunder in the logic that was programmed on the device), and use programmable logic for those sections that require flexibility, ease of programming, potential fixes or enhancements and so on.

Real-life example from a device I worked on: the linear actuator (that was essentially sticking needles into people's brains) had a physical barrier past the safe distance limit. Literally a big hunk of metal that the actuator could not be moved past.

Of course, there was a limit in software as well (the software would refuse to move the motor past the safe limit). However, with the physical barrier in place, it protected the motor more than anything else.

It's worth noting that design decisions in these fields are done not so much based on the amount of brokenness in existing implementations (to put it bluntly, there's about as much broken hardware as there is broken software), but based on the amount of risk that a solution introduces. Many regulating bodies (e.g. FDA) will classify your device in a higher risk class if there's programmable logic in it, simply because software tends to be harder to write and test reliably (not only because of bad programmers, but also because of a lack of standards, or at least consensus and metrics on testing).

Edit: I can't quote numbers right now; there is data that shows the risks involved in purely software-based approaches, and it's easy to see why even in the example above. A software-enforced limit on the distance of movement can fail due to a bunch of reasons, not all of them bugs. Bugs are one thing, but the system could also fail due to a broken connection between the motor driver and the CPU, due to a bug in the hardware implementation of the motor driver itself, due to glitches on the bus and so on.

There are software mitigations for all of these cases, too (e.g. you continuously monitor the position of the motor; if the motor still moves after you tried to stop it, you reset the system, and the hardware is wired so that all power to the motor is cut when you come out of reset), but nothing is as efficient as making sure that the motor just won't be able to move the load past a certain distance by placing a physical barrier in its way. By an uncanny chain of events, maybe all the software mitigations can fail; nothing, however, can make a big hunk of metal disappear into thin air.

shabble · on Aug 1, 2016

Do you have a 'big hunk of barrier presence' monitor though? Otherwise, when someone removes the physical stop during maintenance and forgets to reinstall it, and nobody notices for a long time because the software safeties are all working as intended, until...!

Obviously, you can't go down the rabbithole forever, and this is a bit tongue-in-cheek, but IIRC similar circumstances have occurred.

The San Salvador medical irradiation facility accident[1] is largely a tale of gradual failure of safety features and interlocks, combined with the misunderstanding that they were still adequate or safe, until they weren't.

[1] http://www-pub.iaea.org/MTCD/publications/PDF/Pub847_web.pdf

pessimizer · on Aug 1, 2016

> when someone removes the physical stop during maintenance and forgets to reinstall it

That wouldn't be seen as the fault of the device. Software bugs will still show up when the product is used as directed. Ideally, though, the device wouldn't be able to be used without the physical stop.

notalaser · on Aug 1, 2016

Indeed, there is no way to remove the physical stop.

Furthermore, the design process is shaped so that it leads to such designs. If the first iteration of the design had included a removable physical stop, it would have been caught in the risk analysis. Most mission-critical fields have standards that specify this sort of stuff (IEC 60601 for medical devices, DO-254 and DO-178C for avionics etc.)

When designing this kind of devices, whether it's the fault of the device or not is not too relevant in general. You have to mitigate any kind of unacceptable risk (i.e. things that lead to injury or death).

There are certain common-sense exceptions here. For instance, the device isn't expected to operate properly outside its specified operation conditions, but you have to clearly state what those conditions are (altitude, temperature, humidity, presence or absence of liquids etc.) and put warning labels regarding their breach in the manual). Similarly, the level of mitigation is often specified by standards. E.g. for medical devices, IEC 60601 specifies insulation requirements that will protect against the kind of shocks you could get from a faulty network, but not if the device is struck by lightning or strapped to an electric chair. IEC 61010 (Safety requirements for electrical equipment for measurement, control and laboratory use) similarly includes provisions for the kind of protection you would need in equipment that falls under a specific type of use (e.g. here on Earth, not up there in space).

notalaser · on Aug 1, 2016

There's no way to remove it, the big hunk of metal is part of the actuator's frame. Basically, the only way to remove it is to cut it off with a chainsaw :-D.

jacquesm · on Aug 1, 2016

> Basically, the only way to remove it is to cut it off with a chainsaw

Chainsaws cut wood, I think you meant an angle grinder or a blow-torch.

honkhonkpants · on Aug 1, 2016

Amazing that the machine involved in the San Salvador incident was made by the same company who made the Therac. Seems as though they don't have the knack for making foolproof systems.

Apparently the organization continues to exist under the name "Nordion".

hatsunearu · on Aug 1, 2016

I see it as just more checks and balances. Hardware blocks and software blocks have vastly different failure scenarios which complement each other well.

csours · on Aug 1, 2016

> It’s important to note that all the testing to this date had been performed slowly and carefully, as one would expect.

Ah, so if I click these angular nav-pills very quickly in series without waiting for the page to load, something unexpected might happen?

What is annoying in one context may be deadly in another.

----

If the "Sentinel Event" policy had been in place at the time, perhaps these deaths would have been prevented.

A sentinel event is any event that either leads to death or serious permanent injury -OR- could have lead to death or serious permanent injury.

1. https://www.jointcommission.org/sentinel_event_policy_and_pr...

chillydawg · on Aug 1, 2016

When you are literally a laser death cancer beam, it's probably best to assume that the hardware the software is controlling is actively trying to resist you and that the software controlling the hardware is actively trying to kill a patient.

vkou · on Aug 1, 2016

Strangely enough, nobody is bringing up the argument of 'But how many people were saved by the existence of Therac-25?'

Yet, when a poorly designed self-driving car kills people, many rush to point out (correctly or not) all the lives that were saved by the technology.

The valley seems to have an entirely different level of regard for the consequences of their work then engineers and doctors do.

galdosdi · on Aug 1, 2016

I think that might merely be because self driving cars are in the future, and Therac-25 is in the past.

Also, it's easy to imagine self driving cars that are pretty good, but still occasionally kill people, yet at a rate much lower than human drivers.

For the Therac-25 on the other hand, it's not so complicated. Why it failed and how to fix it is already well known.

vkou · on Aug 2, 2016

It is indeed in the past - unlike the claims about how self-driving technology is currently safer then human drivers (Which is not at all clear yet), we can definitely say that Therac-25 saved more lives then it ended. Yet, it is also a case study in how not to build systems.

It's easy to imagine a medical device that's pretty good, but still occasionally kills people because of flaws in its design. The image is fairly terrifying, actually.

We could also say the same about the fatal Tesla accident - we know why it failed, and how to fix it... And observe how quickly large parts of the tech community (I would expect no less of the vendor) was to blame the human.

Wile_E_Quixote · on Aug 1, 2016

If anyone is interested in how modern versions of these machines work, the annual conference for the American Association of Physicists in Medicine (AAPM) is currently proceeding at the Walter E. Washington Convention Center in Washington, DC, and will continue until Thursday, August 4th. It's the largest international gathering of medical physicists in the world. If anyone is in the area and interested, you shouldn't have much trouble sneaking in and taking a stroll through the vendor area where you can see examples of the newest technologies in imaging and therapy physics. Just dress business casual and you'll blend right in with the other couple thousand physicists.

triplesec · on Aug 1, 2016

Example story "On March 21, 1986, a patient in Tyler, Texas was scheduled to receive his 9th Therac-25 treatment. He was prescribed 180 rads to a small tumor on his back. When the machine turned on, he felt heat and pain, which was unexpected as radiation therapy is usually a painless process. The Therac-25 itself also started buzzing in an unusual way. The patient began to get up off the treatment table when he was hit by a second pulse of radiation. This time he did get up and began banging on the door for help. He received a massive overdose. He was hospitalized for radiation sickness, and died 5 months later."

blobbers · on Aug 1, 2016

This story was covered in the book 'Set Phasers On Stun' (https://www.amazon.com/Set-Phasers-Stun-Design-Technology/dp...). It was required reading in our Introduction to Engineering class in first year.

As an engineering student it was meant to imprint one thing upon us, but talk about a dark introduction to the word 'responsibility'...

chiph · on Aug 1, 2016

Before I switched to CS, I was in a EE program. Our intro to engineering class was filled with stuff like the outdoor decorative fountain that killed six people one at a time, and the transformer box that was used in a game of hide + seek (the lock was missing).

In the case of the fountain, several couples had had a night out and decided to splash in the fountain. The first two knocked a power conduit for one of the pumps loose, electrocuting themselves. The others died as they each entered the fountain to rescue the others. The system lacked a GFCI protector.

nathanyo · on Aug 1, 2016

My introduction to computing course had us read about the Therac-25 and write an essay about it.

While code we write doesn't always have such dire consequences, it was an eye opener to me as a freshman in college. It definitely made designing software/hardware to never fail have a much higher priority on my list of priorities.

83457 · on Aug 1, 2016

"The VT-100 console used to enter Therac-25 prescriptions allowed cursor movement via cursor up and down keys. If the user selected X-ray mode, the machine would begin setting up the machine for high-powered X-rays. This process took about 8 seconds. If the user switched to Electron mode within those 8 seconds, the turntable would not switch over to the correct position, leaving the turntable in an unknown state. ... It’s important to note that all the testing to this date had been performed slowly and carefully, as one would expect. Due to the nature of this bug, that sort of testing would never have identified the culprit. It took someone who was familiar with the machine – who worked with the data entry system every day, before the error was found."

platz · on Aug 1, 2016

These stories are horrific.

jacquesm · on Aug 1, 2016

Yes. They also should be required study material for anybody that writes software interacting with the real world. Humans are fragile. I wrote a cad/cam system for a lathe/mill combo, in all the years that that system was out there we found one bug that made it out past my desk and that only happened because some idiot decided to demo a new feature to a prospect to try to close a sale.

The fact that a simple error could cause someone to lose a limb or die does wonders for your focus.

0xdeadbeefbabe · on Aug 1, 2016

Doesn't it also prevent focus?

jacquesm · on Aug 1, 2016

For me it definitely doesn't. It became part of the release cycle, analyze each and every change made to make sure it was safe and have a whole bunch of code that was tested to the max but never ever changed that would shut down the machine if it ever got outside of it's expected envelope.

This is actually quite a tricky thing to do right because to be able to jog the machine out of a shut-down like that you have to re-enable it in a potentially un-safe situation. For each and every little challenge like that we found a good solution but some of those were real head-scratchers.

A really nasty one that I recall was that when you power up a bunch of latches they can come up in an undefined state, so the decision was made to include a detector for that undefined state which first would have to be cleared before the output of the latches was allowed to influence the motors.

This worked well in practice but given the restrictions of the machine this was all done on it took a bit of thinking, the solution we settled on was a magic sequence output on the parallel port indicating the system had successfully reset after which the relay powering the motor drivers would engage. So until that relay triggered everything else was ignored.

This was already important enough with stepper drivers, but once we switched to servos for some of the more demanding applications it became crucial to safe operation that the drivers would never ever be energized with faulty inputs. A servo driving a ball-screw will happily wreck itself, the machine it is bolted on to and anything standing in-between (including the operator) if it suddenly gets driven to -10 or +10 V and naturally, you'd always get one of those two, never the safe '0'.

0xdeadbeefbabe · on Aug 2, 2016

Is there also a way to make sure one guy isn't responsible? You know the way they credited Petrov with saving us from nuclear war?

jacquesm · on Aug 3, 2016

Well, that can cut both ways. In Petrov's case it worked out well, but 'ownership' of a problem (and the associated responsibility) in smaller settings can also work to bring out the best in people.

Separation of duties is a good principle, and wherever possible you should use it.

https://en.wikipedia.org/wiki/Separation_of_duties

But in the case of a single tech guy in a company there isn't a whole lot you can do in that direction, so it is best to clearly assign ownership and make sure that the people involved realize full well the consequences of a fuck-up.

gene-h · on Aug 1, 2016

As Therac-25 will perpetually tell, when you code your software, best do it well.

ricardobeat · on Aug 1, 2016

"Killed by software" would've been a much more faithful title!

dsfyu404ed · on Aug 1, 2016

"It’s important to note that while the software was the lynch pin in the Therac-25, it wasn’t the root cause. The entire system design was the real problem. Safety-critical loads were placed upon a computer system that was not designed to control them"

They wrote code that depended on hardware controls, didn't document their reliance on the hardware controls and killed a bunch of people. DOCUMENT ALL YOUR DEPENDANCES!!!

wyldfire · on Aug 1, 2016

> They wrote code that depended on hardware controls

...similar to Toyota's electronic throttle control system design.

Also, if you rely on hardware features for safety, it's still ok/good to design the software as if it didn't depend on those features whereever possible.

dsfyu404ed · on Aug 1, 2016

It all depends on the specifics. How much complexity will software checks add? In a lot of cases the lifetime cost of spec'ing a hard failsafe may be cheaper than designing, implementing and supporting a soft one. Then there's the whole issue of software false positives/negatives vs hardware failures

At some level you can no longer abstract away the situation and the system has to perform as a system. You can write soft controls all you want but if hard controls are present in the spec and something that's not likely to change there becomes a point where chasing down every little problem and checking for every possible error is no longer cost effective. Nobody cares if you write Mars rover tier code for a bulldozer and spending the resources doing is wasteful if your competitors aren't also spending the resources. Obviously you can write total crap that's outside the acceptable range toward the other end.

If you're designing software to control a widget that moves and does stuff has a hard switch to prevent your widget's equivalent of an out of battery detonation you have to strike a balance between relying on it and introducing complexity from soft controls. Every aspect of the system is involved in making that determination.

If you've got a hard switch in most cases you may as well write a five line timeout controlled while loop that tries to perform the action and waits and tries again if there's no feedback indicating the action was performed. As long as nobody removes that switch and the code's dependency on it is very obviously documented then that simple unsafe code is probably better than more complex code that performs a redundant (redundant because you have a hard switch) check because the more complex code has more going on.

throwanem · on Aug 1, 2016

I mean hey, if I'm writing code for a machine that might kill somebody if a hardware interlock is removed, I'm going to perform a software check for a valid state before I take the potentially fatal action, and if that's a problem for the organization that's paying me to write that code, I'm going to resign and find work elsewhere, because that outcome strikes me as strictly preferable to having someone's death on my conscience. I suppose it's arguable either way, though.

0xdeadbeefbabe · on Aug 1, 2016

> DOCUMENT ALL YOUR DEPENDANCES!!!

Still doesn't make a difference if there is no place for the documents to live.