Tinkering with DXR

Dan asked me the other day if I'd setup a DXR instance for Thunderbird. About the same time I met with Chris and Mohak to discuss his work to package DXR for Fedora--he's getting pretty close. Both of these things pushed me to do some more work on it and try a few experiments.

I started out by creating a new environment for generating my dehydra indexes, documenting as I went. I then built a new dxr index for mozilla-central (Firerfox) and comm-central (Thunderbird). I was interested to see if anything in comm-central would break my code, but it was fine.

Next I decided to create some scripts to generate new indexes. Right now I just have an old index up for people to use as a demo. Ideally we want something that is updated all the time and tracks the changes going into the tree. As I was doing this, I decided to also fix the number one complaint I hear from people: it's too slow. The easiest way to fix this, without spending more time than I have is to pre-generate all the marked-up source files. Right now, every time you ask for a source file, it gets parsed and marked-up. Every time. For large files (I'm looking at you, sqlite3.c), this is a deal breaker. For normal files it is still annoying.

So I wrote a script to create a copy of the source tree that has already been marked-up. This took ~4 hours and generated 2.4 Gigs worth of html. This seemed large, so I rewrote the script to pipe the marked-up source files to gzip as I went, which apache and the browser will handle just fine, and it weighs in at ~400M.

Next I wanted to see if I could replace glimpse (yes, I've been here before) with something licesend better for inclusion in Fedora. I decided to finally give Swish-e a try. I found it quite easy to use, and especially liked that I could define FileFilters, such that my gzipped html files could be indexed directly.

After building the index, I did some tests and found that the markup produced by my markup generator was breaking the Swish-e html parser. A few minutes later the error was found and fixed. So I tried searching. No matter what I did I only ever got back one hit per page, which made me realize that Swish-e wasn't going to go as far as glimpse and give me the lines themselves: results are documents, not lines in documents. It's not a huge thing: once I have the list of matching files I can grep on a secondary pass. But it meant that it was one more thing I'd have to write. Every time I have to write more it makes me feel like I should just rewrite the whole thing. But there's no time for that today, so this will have to do.

Next week I'm on vacation, but when I get back I'll setup the mozilla-central and comm-central instances using what I did today. I'll blog the links when they go live.