How to crash like you mean it

I lost most of a day today trying to track down a bug in our Pointer Lock (nee Mouse Lock) implementation. We're literally within inches of being done this patch, and to be honest, I need to get it done so I can focus on other projects. However, as is the case when programming, being close simply means you use smaller units to measure your distance, not that you reduce the number of steps. I'm hopeful that we'll have this thing back into review by mid-week. But first a digression while we fix a newly discovered crash bug.

Raymond, Steven, Diogo and myself have been collaborating on the final implementation details and the tests. Today I landed the bulk of the tests in my branch, which Raymond and Steven had written and/or fixed over the past few weeks. I was keen to see how things stood, since Diogo and I had also rewritten some key parts of our code this week. I ran the tests and 5 or 6 tests in, my browser crashed. I hadn't even had coffee yet (heya, Monday!).

Steven was online and quickly offered to fix the tests. But I told him what I'm going to tell you right now: no web page (our tests are just HTML, CSS, and JavaScript--simple web pages) should ever be able to crash the browser. Never. Ever. If you can crash the browser with JavaScript, it's a browser bug, not (only) your bug. I talk to lots of good web developers who all seem to think they've done something wrong when the browser crashes. It's never the web developer's fault, because nothing in that page should be capable of causing the crash. That isn't to say that their code doesn't do something odd, unique, or incorrect. Doing something wrong is not the same as crashing. Remember that. If you can crash the browser, you should report the bug--it might even be worth something, if it's an exploitable security issue (not all crashes are).

Unfortunately reporting this crash just meant swiveling around in my chair, so as to face the reflection in my second monitor. The first thing you need when you crash is to know where you're crashing, often called a 'stack trace,' or just a 'stack'. There are various ways to accomplish this, depending on the circumstances under which you crash. Your standard Firefox browser keeps track of these via about:crashes (i.e, type that into your address bar). Here you can see a list of crashes you've had (I have lots, maybe you have none). If you click any of your crashes, you'll be taken to Mozilla's Crash Server, which will, among other things, allow you to see your crash stack.

This isn't possible in my case. I'm running a local debug build that I made, which isn't hooked up to the crash server. However, because I've made this build, and because it has debug symbols already (you want to make debug builds while you're working on bugs for this very reason), I can simply attach my debugger to the browser before I crash, trigger the bug, and then look at the crash stack.

When you're doing Mochitests, as we are, you can run your tests in a few different ways. If you want to have time to attach your debugger to the running browser before the tests start, and then crash, you can do this:

python ./objdir-debug/_tests/testing/mochitest/runtests.py  
  --test-path=dom/tests/mochitest/pointerlock

This runs the test runner, starting a simple web server and the browser, and points it at just the tests specified in the dom/tests/mochitest/pointerlock directory. NOTE: before you do this, make sure your tests are copied into the object directory (i.e., make -C objdir-debug/dom/tests/mochitest).

Now you can attach your debugger (get your browser's PID using ps or the like, then run gdb and type: attach <pid>). Once connected to your running process, and all symbols loaded, you can tell the debugger to let your browser continue (in gdb, type 'continue'). Run your tests, and if all goes well, you'll crash.

When I crashed this time, my debugger instantly came to life, and I was able to ask for a stack trace (e.g., a 'backtrace' in gdb, using the 'bt' command). The full stack is here, but the top few frames (e.g., function calls) tell the story:

  
#0  ... in nsIFrame::GetStyleContext (this=0x0) at nsIFrame.h:716  
#1  ... in nsIFrame::PresContext (this=0x0) at nsIFrame.h:546  
#2  ... in nsIFrame::GetScreenRect (this=0x0) at /Users/dave/repos/mozilla-central/layout/generic/nsFrame.cpp:4197  
#3  ... in nsEventStateManager::GetMouseCoords (this=0x10033ff20) at /Users/dave/repos/mozilla-central/content/events/src/nsEventStateManager.cpp:4136  
#4  ... in nsEventStateManager::SetPointerLock (this=0x10033ff20, aWidget=0x127d2f240, aElement=0x119c70f80) at /Users/dave/repos/mozilla-central/content/events/src/nsEventStateManager.cpp:4111  
#5  ... in nsDOMMozPointerLock::Lock (this=0x119adc190, aTarget=0x119c71000, aSuccessCallback=0x11b81a680, aFailureCallback=0x11b81a6a0) at /Users/dave/repos/mozilla-central/dom/base/nsDOMMozPointerLock.cpp:293

Here's what this says. At some point nsDOMMozPointerLock::Lock() was called, which is what happens when the user calls navigator.mozPointer.lock() in JavaScript. This in turn called nsEventStateManager::SetPointerLock(), which stores a reference to the locked element on the nsEventStateManager. Part of that process involves figuring out where the mouse pointer is at the time of lock, since the spec says we need to keep track of this position so that we can return the mouse there when we unlock. The call to nsEventStateManager::GetMouseCoords() is where things go off the rails, as it tries to get dimensions for the element's frame in the page--notice nsIFrame::GetScreenRect (this=0x0). The "this=0x0" shows us we are calling a method on a null pointer; in other words, we have no frame for this element.

It's definitely our code that's causing the crash (our C++ code, that is). Now we have to figure out why. For the next little while we tried to isolate the specific test that was causing the issue. That turned out to be a losing battle, since we have so many tests, and since it seemed (initially) like it might be related to the interaction of multiple tests run in a particular order. Was it an OS X only bug? We didn't seem to hit it very often on Linux. The fact that it didn't crash every time was also frustrating.

Reducing it down to a particular test wasn't going to work. We needed to understand why an element would not have a frame. I asked on irc, and instantly got the same answer from two people: display:none. Why does CSS hate me so? If you style an element display:none, it won't have a frame, since it won't be visible. Seems so obvious. But why were we hitting this in our tests? Surely no one would use display:none in...our...tests. Yet here were two cases of it:

  
<iframe id="iframe" style="display: none" ... >  
  
...  
  
<div id="content" style="display: none">  
  <canvas id="canvas" width="150" height="150"></canvas>  
</div>

Many, many tests in Firefox hide content that isn't relevant to the test, and normally it wouldn't matter. But in this case, it made a lot of difference. So first, we need to remove those display:none styles. But we also need to protect against an element not having a frame when lock is called, and losing its frame once locked (display:none style added after being locked). By doing these simple checks, we won't crash in the future, no matter what the web developer does with the display style.

I wanted to write about this for my students who are doing work like this all semester, and will inevitably hit an issue like this in their own code. Crashes can be really useful when you have a strategy for using them to your advantage.