Everything Breaks, All the Time.

jng · on June 9, 2011

This guy is wrong, plain and simple. Unforeseeable bit-copying errors caused by cosmic rays and similar circumstances, which do exist, probably account for less than a one out of every billion actual bugs/crashes experienced out there. When a bit is toggled by a cosmic ray in memory, if it hits program memory, it will with a significant frequency crash your program.

Most bugs you experience daily (Word or your favorite game/appp crashing, etc...) are caused by actual software errors in overly complex systems with many dependencies. Multithreaded-coding errors can often account for many of those, but not only, complex system with many layers and complex dependencies can often hide obscure behaviors that can cause crashes in a given machine if, for example, you have a weird combination of disk drivers, file system code, and an antivirus hooking and acting on every filesystem read or write.

When, for example, a Word plug-in makes a call to Word's object model, this goes through easily 10 software layers until it reaches its target, some of these layers being configured via the flaky Windows registry, others going through jumps from VM-based to native code using weird "marshalling" techniques, etc... in these cases, you may encounter buggy behavior in any one of the 10 layers, or in a combination of two of them, even if it seems like you are just incrementing a simple counter.

Most of the time, though, bugs are caused by the app's own code (your own code): careless code, dangerous practices, lack of solid control-flow design, etc... if you write really good code, it's unlikely you will have many support issues. Only if you are working in some problem-prone area: plug-ins to other complex, often poorly-designed products, code pushing graphics drivers to the max, etc... where you get into "complex system" behavior.

Even if you use multithreading, if you control all the code, you can write very solid code. If your multithreaded code is perfect, it won't crash. Although it can uncover bugs in third-party libraries, etc... which is why I tend to write only "worker threads" with no third-party dependency if multithreading is required.

And I think it's very dangerous to warn novice programmers to think that the bug is probably somewhere else.

gwc · on June 9, 2011

His point makes a lot more sense in the context it was originally intended. He's not making a point about programming or debugging in general; he's specifically discussing tech support as a one-man indie game shop. In particular, it's all about the cost-benefit tradeoff. In his words (taken from the first post in the series - http://jeff-vogel.blogspot.com/2011/06/seven-tips-for-giving...):

But at the same time, as a small developer, you have very little time to spare for support. Time spent getting the game working for one person is time not spent making a new game for everyone. You will need to develop a sense of when the time lost helping a person is not worth it, either because you won't be able to solve their problem or because they will not able to implement the fix you provide.

...

Remember: It's only worth the time to do tech support if you have the chance to, in a reasonable amount of time, fix a problem and make a loyal customer. If you realize that, at the end of the road, you aren't going to end with a happy person and a working product, end the conversation as quickly and pleasantly as possible.

In that context, I think his approach is very rational. If you pushed him, he'd probably agree that more often than not the issue is in his code (even if it's just a question of inadequate error handling). However, if the problem is only seen by a single user and will be a significant investment to try and fix, then it's simply not worth the time when he could be working on a new game, a port, or even a different problem that has been seen by multiple users.

tsewlliw · on June 9, 2011

I dont get this, its so often a bug, and so many people dont report bugs, this strategy of telling people to reboot or reinstall or redownload just perpetuates these voodoo-style fixes.

Im not saying fix everything always immediately, but dont write people off as victims of cosmic rays just because you can't repro in 30 seconds or dont see the bug where youd expect in the code.

jodrellblank · on June 9, 2011

He didn't say "cosmic rays" anywhere in the article.

voodoo-style fixes

They're not voodoo, they're sledgehammer to smash a nut fixes. A reboot reinitializes every part of your system into a mostly-known-good state. If you knew what, you could say "restart this service" or "reinitialise this driver like that", but a reboot gets all of it.

If you actually stabbed a doll with a pin and your program started working, that would be ... scary.

tsewlliw · on June 9, 2011

His actual criteria for taking the time to find a bug is reasonable, but I take issue with the assertion that its not a bug in code he wrote most of the time.

wccrawford · on June 9, 2011

Only checking for a bug reminds me of the Intel bug that they claimed would hardly ever happen, but turned out to happen a LOT.

I don't ignore bugs. I follow the same first step, and send the standard list of things to try like reinstalling, rebooting, etc. But if they still have it, I always look into it. Almost every time it's been a real bug. Some were really hard to track down, but would have caused a lot of grief later. I was always glad I did it.

Quarrelsome · on June 9, 2011

Have you encountered a hardware defect yet? If/when you do, it represents a lot of technically dead time that was spent looking at code. I'm not saying either premise is right but I can appreciate his philosophy here.

For the record, I killed four weeks digging through code and running tests and it turned out that temperatures in winter coupled with some bad soldering was the cause of the issue. D:

JoeAltmaier · on June 9, 2011

Yes, you have to track them down, yes it takes forever for the hard ones. But most of the time it Isn't hardware, most of the time it my own bug and I can fix it.

It definitely takes some experience to be good at debugging. I guess that's why all the emphasis on development environments these days, where the hard stuff is being debugged by someone else and I can work on my app-level stuff in peace.

wccrawford · on June 9, 2011

I have definitely encountered bugs that I could not track down, which very well might have been a hardware problem. I was eventually forced to give up on it. (And I've gotten better at giving up earlier when something is going to be impossible to find.)

But I have never said, 'I won't look into a bug until multiple people have it.'

k33n · on June 9, 2011

This fact drives me insane. It's theoretically possible to write a perfect piece of software that can never fall down, break, blow up, etc. But it's actually pretty much impossible in practice unless you have either near unlimited resources (NASA in the 60's), but even with that you still might fail (Microsoft).

synnik · on June 9, 2011

There are two completely different conclusions that I would draw from his facts:

1) Most bugs are in code. But it might not be your code. Your code layers itself on top of many other layers of code that are outside of your control. Learning to deal with that will make a difference in your work.

2) Know how everything works. I am always hocked at people who claim to be web developers who don't even understand how an HTTP request/response works, much less what your browser does with the results. It is one of my interview questions for tech folk - I ask them to explain to me exactly what happens on the server when a browser sends it a request. Few people can give much detail here. Most can only give a generic explanation of the actions taken, if that.

rickdale · on June 9, 2011

My biz partner has a GPS system from garmin. He lives in the central time zone, but works in the eastern time zone. Any time we use the GPS it will always add an hour to our trip when we are in EST.

Programmers aren't perfect. Practice makes permanents.

JoeAltmaier · on June 9, 2011

Ha! And my sister went to Egypt and looked up the gps distance to home (Iowa): 8000 miles. Off by 50%. Why? The programmer was doing cartesian distance instead of great-circle. So yes, if she drilled a tunnel through the Earth's core, it was only 8000 miles. :)

T-hawk · on June 10, 2011

No, 8000 miles sounds about right for the great-circle distance on the surface from Iowa to Egypt.

First, the diameter of the Earth is not quite 8000 miles, but 7926. So if anything says more than 7926 (plus maybe the height of a mountain or whatever), it's not calculating a straight line in Cartesian 3d space.

Second, that distance of 7926 miles would be from a point to its antipode. Iowa is not antipodal to Egypt, not even close. The antipode of Iowa is in the Indian Ocean and hundreds of miles from any land. The straight-line distance from Iowa to Egypt through the Earth's sphere would be more like 6000 miles.

sedachv · on June 9, 2011

Read Jim Gray's Why do computers stop and what can be done about it? (http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf)

Excerpts:

"In the measured period, one out of 132 software faults was a Bohrbug, the rest were Heisenbugs."

"[retry] routines had a 76% success rate in continuing system execution."

Cosmic rays or race conditions, transient bugs are common.