Failure as contribution

Since I've got some time to kill here while I wait for just about every piece of software I need on my box to rebuild, I thought I'd reflect on what I've been doing the past few evenings.  I want to try and show what I mean by "contribution" when I ask my students to contribute to other projects.  At the same time, I want to show how failure can itself be an important source of contribution.

I'm concurrently working on two problems.  First, I'm trying to add JavaScript code to DXR.  Second, I'm trying to integrate a complete callgraph for Mozilla's C++ code.  Both are turning out to be somewhat non-trivial, and are presenting me with the right level of personal challenge and failure in order to provide me with equal parts frustration and teachable moment.

I think a lot about failure.  As a professor taking students into the bowls of a web browser, you have to think about failure.  Failure is what will happen.  It isn't something you can avoid.  It's where you spend the majority of your time when you work at the scale of something like Mozilla.  As I've written elsewhere, failure is data.  Failure is information on the road to getting things done.  You avoid failure at the risk of avoiding success.

Tonight I'm failing:

<@taras> lol
<@taras> humph: it's just not your day today
< humph> I'll pull a teachable moment out of it for my blog/students
< humph> otherwise it's a disaster
< dwitte> humph: lol, that's unfortunate
<@taras> humph: valgrind should tell you where it's going wrong
< dwitte> start with upgrading to 4.3.latest, since i've tested on that
< dwitte> after that, bust out the debugger :)
< dwitte> or valgrind!
I'm failing, but it's not for lack of trying.  I started out by trying to rewrite the Mozilla build system to preprocess JavaScript files such that they end-up in the source dir instead of being put in jar files or getting copied to the object directory of a build.  I got pretty far, but then I hit a wall.  My hacks to our build system and preprocessor came to a crashing halt when I hit files that included files that get the idea.  Why do we even allow includes?  Clearly no one has ever tried to extract a complete set of parsable JavaScript files from all of our .js, .xul, .xml, xhtml, etc.  I got within inches of the end, only to see that the road curves off to the left for another 10 miles.

Rather than give up, I did what you do in software development, and switched tasks.  At the same time that I was failing on the JavaScript extraction, I saw a familiar plea on irc: "How can I find all the callers of nsXULAppInfo::GetName?"  This is a very common type of question, and it is hard to answer.  One of the things I'm working on is making it easier to get this info.  To that end, I decided to switch over to working on integrating Dan Witte's fantastic work to build a complete Mozilla callgraph.

Luckily Dan has kindly provided some documentation on building and using his tool (it's great to see that the Mozilla static analysis tools group all seem to do this).  I began by building Treehydra and quickly hit a wall.  Even though it built, it couldn't be used to build Mozilla.  I did some checking and noticed that I was failing a bunch of treehydra tests.  I tried rebuilding a few other things to see if I'd missed something, and still couldn't get it to work.  Asking on irc brought no help.  After deciding it was not just something stupid I was doing, I filed a bug.

A little while later, Taras offered to look at my box.  After poking around for a while, he agreed that this was not just me, and represented a bug in Mozilla's JS implementation, SpiderMonkey.  He offered to write a work-around patch for me if I'd help him by finding a regression window.

Regressions are bugs where you suddenly break something that previously worked.  They are accidental side-effects of making other changes, and often don't show-up for some time.  When you do notice them, you want to go back and figure out where you introduced the bug.  In other words, you want a window of time where it went from 'working' to 'not working'.  Finding regression windows can be time consuming, and depending on the bug, laborious.

In my case I needed to try various versions of the SpiderMonkey code combined with my Treehydra build, and then run the build tests.  Without version control, this would be a nightmare, as it would require downloading many builds and testing.  However, with version control, and especially Mercurial (which keeps the entire history on your local machine), this is pretty easy.

Mercurial revisions are numbered two ways.  First, there is a local incremental number--an integer.  Second, there is a hash that represents the changeset globally, and is the same on everyone's machine.  To make this job easier, I did what any sane programmer would do, and wrote a quick shell script that took one argument, the local revision to use for my update, and updated/rebuilt/tested.

Next I started making some guesses.  Talking with people on irc, I guessed that this had worked sometime in the past 1-2 months. I tried going back to September and running my script.  Very quickly I found a revision where all the tests passed.  After that it was a matter of bisecting the revisions between my current (failing) revision and the passing one some months back.  After a dozen attempts, I had it: here was the last time it had worked.

At the end of all this, I still have a broken build (it finished while I was writing this).  I haven't fixed my problem yet.  However, my problem turned out to be a bug that had been silently introduced a few months ago.  Someone needed to fail for it to get found and fixed.  Someone needed to do the work of figuring out when it got introduced.  Someone also needs to fix it.  But the fix only comes after it has been identified.

I tell this story so as to encourage my students (and all new contributors) to not lose heart when they fail at things.  Your failure may be pointing at a much larger issue.  In an open project like Mozilla, there are no personal bugs.  The community owns them.  You become part of that community when you contribute to finding and fixing them, and failure is how you get there.  You can't contribute without failure.

Show Comments