Tinkering with DXR

Dan asked me the other day if I'd setup a DXR instance for Thunderbird.  About the same time I met with Chris and Mohak to discuss his work to package DXR for Fedora--he's getting pretty close.  Both of these things pushed me to do some more work on it and try a few experiments.

I started out by creating a new environment for generating my dehydra indexes, documenting as I went.  I then built a new dxr index for mozilla-central (Firerfox) and comm-central (Thunderbird).  I was interested to see if anything in comm-central would break my code, but it was fine.

Next I decided to create some scripts to generate new indexes.  Right now I just have an old index up for people to use as a demo.  Ideally we want something that is updated all the time and tracks the changes going into the tree.  As I was doing this, I decided to also fix the number one complaint I hear from people: it's too slow.  The easiest way to fix this, without spending more time than I have is to pre-generate all the marked-up source files.  Right now, every time you ask for a source file, it gets parsed and marked-up.  Every time.  For large files (I'm looking at you, sqlite3.c), this is a deal breaker.  For normal files it is still annoying.

So I wrote a script to create a copy of the source tree that has already been marked-up.  This took ~4 hours and generated 2.4 Gigs worth of html.  This seemed large, so I rewrote the script to pipe the marked-up source files to gzip as I went, which apache and the browser will handle just fine, and it weighs in at ~400M.

Next I wanted to see if I could replace glimpse (yes, I've been here before) with something licesend better for inclusion in Fedora.  I decided to finally give Swish-e a try.  I found it quite easy to use, and especially liked that I could define FileFilters, such that my gzipped html files could be indexed directly.

After building the index, I did some tests and found that the markup produced by my markup generator was breaking the Swish-e html parser.  A few minutes later the error was found and fixed.  So I tried searching.  No matter what I did I only ever got back one hit per page, which made me realize that Swish-e wasn't going to go as far as glimpse and give me the lines themselves: results are documents, not lines in documents.  It's not a huge thing: once I have the list of matching files I can grep on a secondary pass.  But it meant that it was one more thing I'd have to write.  Every time I have to write more it makes me feel like I should just rewrite the whole thing.  But there's no time for that today, so this will have to do.

Next week I'm on vacation, but when I get back I'll setup the mozilla-central and comm-central instances using what I did today.  I'll blog the links when they go live.