Monday, August 30, 2010

Some thoughts on hunting bugs (Software in General)

I don't tend to complain too much, but it was annoying as hell to track down this bug, so I though I'll share some of the joy with you =) This post is mainly for software users (not GIMP-specific) as it should give you an insight on how to track the exact cause of a bug, and so make better reports and make it easier for the developers to fix the bug =)

Developers may also enjoy this post - thinking about this bug and how I tracked it, it is in fact a very funny bug :D



Here is the original report:
Since 2.6.7 or 2.6.8 I am having a persistent crash with GIMP:
When CTRL-left clicking on a layer mask to toggle it, about 30% of the times my GIMP segfaults and badly crashes.
Helpful ain't it? A bug which can't be reproduced in any deterministic way is a nightmare to track down :P I can't blame the reporter (especially now that I know the cause, it was very annoying to find) since it's not his fault that this is how the bug appeared.

Several days later, someone shows up on IRC complaining that GIMP crashes when he alt-click's a layer. The first thought I had is "yay! sounds similar - maybe he knows why". He provided steps which always cause a crash for him. When I tried them it didn't reproduce the bug for me...

So he said he'll clear his gimp configuration directory and restore files one by one until he finds the one which causes it. After some time he returns back to us, saying the sessionrc file is the cause for the bug and he sent the file. I tried his steps and indeed, I managed to reproduce the bug.

"So now you know the cause, that's what you wanted!". Usually, I would have agreed on the last sentence, but not in this case - this a general file, with many settings, which doesn't contain anything which even seems related.
So using the same technique as above, I removed the settings one by one until I found the one which causes the crash. The one which caused the bug was a setting in the user interface which determines which tab is open. That doesn't even make sense, or as mitch said "What?!!" :P Which tab in the user interface is open should not make a program crash...

Now, to the last step of tracking down the exact cause - trying, trying, trying... For 40 minutes I tried with the user interface using the instructions the guy on IRC has given - each time trying to add/remove steps to see which ones affect the bug.
The result, was that if a certain tab is hidden (the layers tab) and you restore it later, then if you click at some place directly after restoring it, it would cause GIMP to crash. Certainly one of the strangest bugs I ever saw...

So what do we learn from this? The 4 "golden rules" of hunting bugs:
  1. Make it Deterministic: There is usually no such thing as a random bug - Almost every bug has a series of steps that will always reproduce it. Finding this series of steps is essential in order to continue the debugging process. This may take lots of time, so it's recommended to try reproducing it in an organized way, maybe even writing down what doesn't help reproducing it.
  2. Make it small: When you found the series of steps, you should try to eliminate as many of them as possible to keep reproducing the bug using a minimal amount of operations. If any files are related, try to isolate only the ones that affect the bug. Less steps mean less things to check.
  3. Make sure it's essential: With the remaining steps, try to replace them by others - if a step can be replaced by other steps which are not related, then it's probably not the direct cause of the bug.
  4. Finally - Make no Assumptions: Bugs don't have to make sense :D Even the most weirdest things can be related. Don't say "This can't be the cause" without actually checking. Bugs should make sense eventually, but the logic behind them is not always obvious at first.
Following the 4 steps above should help you find the exact cause of the bug you are reporting/hunting. It will make your reports very very effective, and you will save lots of time to the developers, so they can work on the important stuff ;-)

For those who want to take a look at the specific bug - here it is: https://bugzilla.gnome.org/show_bug.cgi?id=627328

2 comments:

  1. A tip: you didn't have to remove the settings one by one, you could have done a binary search, that is, remove half of them, then depending on outcome, either remove a new half or restore half of what you removed :)

    ReplyDelete
  2. @Martin: True :) I just didn't want to get too detailed aboue explaining a binary search - trying tokeep this series of bug posts as simple as possible.

    ReplyDelete