I've learned something today. The cPickle module in Python can be boosted with very little effort. I've also learnt that there's something even faster than a hotted 'cPickle': marshal.

The code in question is the CheckoutableTemplates which saves information about the state of templates in Zope to a file on the file system. The first thing I did was to insert a little timer which looked something like this:

def write2config(...):
    t0=time()
    result = _write2configPickle(...)
    t1=time()-t0
    debug("_write2configPickle() took %s seconds"%t1)
    return result

I ran it many times over to be able to generate some sort of average time for writing a config item to file. The first result was: 0.0035016271.

The second thing I did was that I rewrote the algorithm at which it does the writing. I managed to prevent one avoidable read from the pickled file. This was timed in the same fashion again and the second result was: 0.00175877291 which is twice as fast already!

Now the cPickle.dump() function has an optional parameter called proto. I let the code explain itself:

>>> print cPickle.Pickler.__doc__
Pickler(file, proto=0) -- Create a pickler.

This takes a file-like object for writing a pickle data stream.
The optional proto argument tells the pickler to use the given
protocol; supported protocols are 0, 1, 2.  The default
protocol is 0, to be backwards compatible.  (Protocol 0 is the
only protocol that can be written to a file opened in text
mode and read back successfully.  When using a protocol higher
than 0, make sure the file is opened in binary mode, both when
pickling and unpickling.)

Protocol 1 is more efficient than protocol 0; protocol 2 is
more efficient than protocol 1.

Specifying a negative protocol version selects the highest
protocol version supported.  The higher the protocol used, the
more recent the version of Python needed to read the pickle
produced.

The file parameter must have a write() method that accepts a single
string argument.  It can thus be an open file object, a StringIO
object, or any other custom object that meets this interface.

So I tried writing the pickle file in a binary mode with proto=-1 which boosted the average time down to: 0.000777201219 which is more than twice as fast as the improved algorithm.

Lastly. I had completely forgotten about the marshal module. It basically does was pickle and cPickle does but is much more primitive. This is what Fredik Lundh writes in the book Python Standard Library about the pickle module:

"It's a bit slower than marshal, but it can handle class instances, shared elements, and recursive data structures, among other things."

But for my particular problem, all I had to serialize was a simple but long list and a dictionary; so I can use the marshal module without any problems. Rewriting the code to use marshal instead of cPickle get it another boost so the fourth and last result was: 0.000445931848. That's less than twice as fast as the previous solution. But, the difference between the beginning and the end is from 0.00350162718 to 0.000445931848 the difference is roughly a factor of 8! Pretty neat, huh?

Results:

0.00175877291747        _write2Picklconfig()
0.000445931848854       _write2marshalconfig()
0.00350162718031        OLD_write2Pickleconfig()
0.000777201219039       _write2Pickleconfig(proto=-1)
Andrew - 15 February 2005 [«« Reply to this]
Instead of writing your own timing stuff, why not use the timeit module in the standard library?
Peter - 15 February 2005 [«« Reply to this]
Isn't timeit just for strings of code?
In my setup I was able to "listen in" on a running program.
Joe Wreschnig - 15 February 2005 [«« Reply to this]
The problem with the marshal module is that the underlying format is subject to change without notice. This can make it useful for passing objects around as strings between Python instances on the same machine, but not very useful for any kind of long-term storage. Mostly it's designed to generate pyc files.
Peter - 15 February 2005 [«« Reply to this]
Interesting.
Fortunately my program doesn't need to do long term storage. I also don't need to open the serialized file from any other perspective other than debugging.


Your email will never ever be published