Major performance fix on file searches

19 November 2005   0 comments   Zope, IssueTrackerProduct

Powered by Fusion×

A week ago I ran some ad hoc benchmarks on various suspect functions in the IssueTrackerProduct and came to a clear and simple conclusion: searching is the bottleneck and within the search it's the searching for file attachments that take all of the time.

If you're interested and open minded, here's the results of that benchmark This sparked some thoughts. First I wrote a filename splitter which isn't rocket science but I'm proud to say that it's use is brilliant. Before, the find-by-file function in the IssueTrackerProduct used a plain old find() test like this:

filename.find('foo') > -1

This is very fast but not very intelligent. For example it'll with match on foobar.txt and plainfooter.gif. So, what I did instead was to create a KeywordIndex and index all the splitted filenames in that index.

What you have to remember is that KeywordIndexes are case sensitive so when I populate the KeywordIndex I have to lowercase everything. But, now let's get to the performance fix.

The problem was that before, I used a ZopeFind() to find files by filename and ZopeFind() is slooooowww compared to a catalog search. Before I switched on the new code that searches the filename splitted keyword index I did a coule of searches on an issuetracker with about 200 issues and a couple of searches on an issuetracker with more than 10,000 issues. The results can be seen here: itp-benchmark-medium-before.log and itp-benchmark-huge-before.log Quite terrible isn't it. Then I switched on the new code that is much more intelligent and hopefully faster. The results can be seen here: itp-benchmark-medium-after.log and itp-benchmark-huge-after.log



As you can see, that's an enourmous increase in speed. For the medium sized issuetracker, that's a 600% increase and for the huge issuetracker that's a 2000% speed increase. Oops! This shouldn't be called optimization maybe. Perhaps bug fixing is a better word :)


Your email will never ever be published

Related posts

Automatically refreshing issue 17 November 2005
createElement('a') with a javascript href 21 November 2005
Related by keywords:
To readline() or readlines() 12 March 2004
bool is instance of int in Python 05 December 2008
Reciprocal lesson about gender perspectives 02 September 2011
Nginx vs. Squid 17 March 2009
IssueTrackerProduct now officially abandoned 30 March 2012
How and why to use django-mongokit (aka. Django to MongoDB) 08 March 2010
On the command line no one can hear you screen. Or can they? 03 May 2012
Nasty surprise of Django cache 09 December 2008
Random ID generator for Zope 02 September 2005
Google Calendar, iCalendar Validator but not bloody Apple iCal 09 April 2009
tempfile in Python standard library 07 February 2006
In Django, how much faster is it to aggregate? 27 October 2010