So, funny story. The other day Chris grabbed me on irc to ask one of his usual innocent questions:
ctyler: so how long do you think it would take to write a drop-in replacement for glimpse?
humph: without looking at the code, no idea.
ctyler: I was thinking less than a week
I'll jump to the end of my story for a second and tell you that, just as I thought, he was wrong. Way wrong. Several orders of magnitude wrong. In fact, it took 10 minutes.
Glimpse is the indexing/query tool used in MXR to do string searches (like this one) . Chris and I are starting to think about packaging the MDRK, and Chris was looking ahead to doing a Fedora package. The trouble with glimpse, at least one of its issues, is that it isn't licensed very well. Sure you can use it for "free," but that's not how we like to work, and that's now how you get into a Linux distro that cares about free and open source software.
Well, I'm more than a bit skeptical about life in general, and Chris' idea of replacing glimpse in a week was too much for me. After laughing audibly at him into my irc client, we set about doing some tests. "How fast would it need to be?"..."No slower than glimpse, I guess"..."How fast is glimpse?"..."Well, you can test it at the command line..." And so we found ourselves at the command line.
Here's how glimpse works. Let's say you want to index and search Mozilla source. You could do this:
- mkdir ~/ff
- cvs -d :pserver:firstname.lastname@example.org:/cvsroot co mozilla/client.mk
- cd ff
- make -f client.mk checkout MOZ_CO_PROJECT=browser
- mkdir ~/ff-index
- echo '/CVS/' > ~/ff-index/.glimpse_exclude
- glimpseindex ~/ff/mozilla -H ~/ff-index
Now you have a source tree and an index of that tree (~11M), and you can start to do queries. Here's how you do that (or more specifically, here's how MXR uses glimpse to do that):
glimpse -i -H ~/ff-index -y -n -e search_string
So, if you wanted to search for XPCOM, you'd do this:
glimpse -i -H ~/ff-index -y -n -e XPCOM
That will spit out line-after-line in the following form:
filepath: line_number: line
The fact that this is something you can do from the command line is great, because it makes it easy to experiment and time it. After playing with this for a while, and getting a sense of the time things take, Chris thought it would be interesting to compare it to doing a recursive grep of the tree instead. I know, funny, right? I stopped laughing when I saw the numbers.
It turns out that after the files are loaded into disk cache, grep actually does really well looking through the tree. So well, in fact, that we decided to hack-up a few tests in order to find its performance floor. Here's a pretty graph (props to the Google Chart API) that helps show you what we found:
What we did was to come-up with 21 queries (simple strings, case sensitive strings, regular expressions, multiple words, etc.) in order to simulate how different people might search through the tree. We then wrote parallel versions of MXR's search (i.e., the wrapper around glimpse), using glimpse and variations on grep. Then we ran our tests 10 times for each of the 21 queries on each search back-end, and averaged the results. This graph shows you how long it took for a query to be run and return HTML results.
At first we used glimpe (red) and a pure recursive grep (green), and found that we were never faster than glimpse, except when glimpse fell down trying to do a complex regex or edge case search query. For example, glimpse can find all instances of 'int' in 0.297 seconds, while grep takes 2.371 seconds. On the whole, glimpse has a lot of jitter, where grep is never as fast, but also never really slow. It's like a dependable, if somewhat plodding, shoe.
After experimenting with variations on egrep, fgrep, etc. it became clear that I/O was grep's Achilles heel: it doesn't matter what you ask grep to do (within reason), it will basically cost you the recursive tree walk. "What if we can get rid of the recursive I/O?" Good idea:
- cd ~/ff
- grep -nRI --exclude=CVS . mozilla > everything
- grep -F 'XPCOM' everything
Now we have a flat file that contains the whole tree, with filename:line_num:line. What happens when you grep everything vs. the recursive tree? I get a smile on my face, that's what:
glimpse search for 'PRInt64' = 1.819 seconds
recursive grep for 'PRInt64' = 1.307 seconds
grep of everything for 'PRInt64' = 0.335 seconds
Regex you say? Allow me:
glimpse search for 'map<[^>]*>' = 5.928 seconds
recursive grep for 'map<[^>]*>' = 1.525 seconds
grep of everything for 'map<[^>]*>' = 0.575 seconds
Can we do better? What if we gzip the everything file and zcat that into grep? Turns out this is a net loss (see graph above). What if we replace any extraneous white space with a single space? That works the best of any solution we tried (see truncated-flat-grep in the graph above), but it breaks the layout of search results. Maybe we can account for this somehow. Also, since MXR caps results at 1001, why shouldn't we stop searching in everything when we hit that too? Thank-you very much, another win!
There's no sense having this much fun and not letting you play along at home. Here is the glimpse version of search, and here is the non-truncated flat-file grep version. It's hard not to love the underdog, especially when it comes from behind and lays a whooping on the favourite.
- These tests were all conducted on a "good" PC running Fedora 8. My initial tests on a MacBook Pro have been not as good, which I'm pinning on my underpowered hard drive. UPDATE: grep on everything on my mbp is similarly fast, so my smile stays.
- We make no assertions about glimpse in general, only in this case, and with this data set. However, this particular case is our case, and we want it to be fast and free to ship to end-users.
- Q: Why doesn't the flat-grep version find matches in the filenames? A: because we build a regexp that takes this into account: /:[^:]:$pattern/
- I <3 grep