Jun 19, 2014. Sphinx Searches Code at Searchcode.com

Searchcode.com is a source code search engine that uses Sphinx. In this blog post, we hear from Ben Boyter about how he has overcome the unique challenges presented by searching source code.

Selection_041

Code search is difficult

Ben wrote about why code search is difficult here. The central challenges boil down to word boundaries and special characters. You can’t rely on spaces alone to delimit terms and you have to index all the special characters. Consider the following:

i++

Should match all of the examples below,

for(i=0; i++; i<100) {
for(i=0;i++;i<100) {
spliti++;

A quick comparison

The Searchcode team has overcome the challenges associated with searching source code. And, it works quite well. See the following examples which compare searchcode to github and code.ohloh:

i++ at searchcode
i++ at github
i++ at code.ohloh.net

How it’s done

Let’s hear from Ben about how he does it:

Sphinx has powered the search functionality of searchcode since it began in 2010. It provides the raw searching and faceting functionality over 19 billion lines of source code. Each document has over 6 facets and there are over 40 million documents in the index at any time. Sphinx serves over 500,000 queries a month from this with the average query returning in less than a second.

Searchcode is an unusual beast in that while it doesn’t index as many documents as other large installations, it indexes a lot more data. This is because the average document size is larger and also because of the way source code is delimited. It is also necessary to index special characters. The result of these requirements is that the index when built is approximately 3 to 4 times larger than the data being indexed. The special transformation’s required are accomplished with a thin wrapper on top of Sphinx which modifies the text processing pipeline. This is applied when Sphinx is indexing and when running queries. The resulting index is over 800 gigabytes in size on disk and when preloaded consumes over 25 gigabytes of RAM.

This is all served by a single i7 Quad Core server with 32 gigabytes of RAM. The index is distributed and split into 4 parts. All queries run over network agents allowing to scale out seamlessly. Because of the size of the index and how long this takes, each part is only indexed every week and a small delta index is used to provide recent updates.

Every query run on searchcode runs multiple times as way of improving results and avoiding cache rot. The first query run uses the sphinx ranking mode BM25 and subsequent queries use SPH04. BM25 uses a little less CPU than SPH04 and hence new queries use it, as return time to the user is important. All subsequent queries run as an offline asynchronous task which does some further processing and updates the cache so the next time the query is run the results are more accurate. Commonly run queries are added to the asynchronous queue after the indexes have been rotated to provide fresh search results at all times. Searchcode is currently very CPU bound and given the resources could improve search times 4x with very little effort simply by moving each of the the Sphinx indexes to individual machines.

Searchcode updates to the latest stable version of Sphinx for every release. This has happened for every version from 0.9.8 all the way to 2.1.8 which is currently being used. There has never been a single issue with each upgrade and each upgrade has overcome an issue that was previously encountered. This stability is one of the main reasons for having chosen Sphinx initially.

The only issues encountered with Sphinx to date where some limits on the number of facets which has been resolved in the latest versions. Any other issue has been due to configuration issues which were quickly resolved.

In short, Sphinx is an awesome project. It has seamless backwards compatibility, scales up to massive loads and still returns results quickly and accurately. Having since worked with Solr and Xapian, I would still choose Sphinx as searchcode’s indexing solution. I consider Sphinx the as Nginx of the indexing world. It may not have every feature possible but its extremely fast and capable and the features it does have work for 99% of solutions.


« »

2 Responses to “Sphinx Searches Code at Searchcode.com”

  1. Izer0 says:

    UI needs improvements, it’s confusing. Old Google code search had very good UI with syntax highlighting and other features. For first quick step I suggest to bold founded text, this help more. Good luck guys.

  2. Ben Boyter says:

    Hi Izer0,

    Thanks for that. I have the highlighting almost done and will push it out in the next day or two.

    I will also be looking more at Google Code Search and replicating a lot of the functionality there based on what I can find out.

    Just for the record, its actually just one person working on searchcode at the moment, but thanks!

Leave a Reply