Filename splitter

15 November 2005   4 comments   Zope, Python

Mind That Age!

This blog post is 12 years old! Most likely, its content is outdated. Especially if it's technical.

Powered by Fusion×

I need to create a Zope index for a ZCatalog that is KeywordIndex. A KeywordIndex is a list (array if you like) that is used to describe some data. For example, if the data is "Peter is a Swedish Londoner", the the keywords are ("peter", "swedish", "londoner"). What about if the data you want to create an index of is a filename like "NameLog.txt" or "holiday-00412-juli-05.jpg". I've now quickly written a little something that seems to do a decent job. It splits the filenames (these are filenames only and no paths) by caMel structure, dot (.), underscore (_), dash (-) and digits.

If you want to play with my little script, have a look at If you open that script you'll see that it tests a whole bunch of filenames (taken from the Demo issuetracker) and if you want to see what this the result is, here it is:

[Error-04August2005.log, Error, 04, August2005, log, Error-04August2005, 04August2005.log, August, '2005']
[ITPt.rtf, IT, Pt, rtf, 'ITPt']
[Image69.png, Image69, png, Image, '69']
[Image70.png, Image70, png, Image, '70']
[, Issue, Tracker, py, 'IssueTracker']
[, Issue, User, Folder, py, 'IssueUserFolder']
[Request_NameError.txt, Request, Name, Error, txt, Request_NameError, 'NameError.txt']
[Traceback_NameError.txt, Traceback, Name, Error, txt, Traceback_NameError, 'NameError.txt']
[addhrefs-0.8-dev.tgz, addhrefs-0, 8-dev, tgz, addhrefs, 0.8, dev.tgz, 0, '8']
[ajax_bug.bmp, ajax_bug, bmp, ajax, 'bug.bmp']
[catalogEntries.png, catalog, Entries, png, 'catalogEntries']
[demo-icatalog.png, demo-icatalog, png, demo, 'icatalog.png']
[doc2.htm, doc2, htm, doc, '2']
[dummy.dtml, dummy, 'dtml']
[1027_Sample_Chapter.pdf, 1027, Sample, Chapter, pdf, 1027_Sample_Chapter, Chapter.pdf, 'Sample_Chapter.pdf']
[10erBarcode.jpg, 10er, Barcode, jpg, 10erBarcode, 10, 'erBarcode.jpg']
[11111.txt, 11111, 'txt']
[25.gif, 25, 'gif']
[60recicla.gif, 60recicla, gif, 60, 'recicla.gif']
[DQS_certified.gif, DQS_certified, gif, DQS, 'certified.gif']
[Innovations in Behavioral Marketing and Electronic Commerce.doc, Innovations, in , Behavioral, Marketing, and , Electronic, Commerce, doc, 'Innovations in Behavioral Marketing and Electronic Commerce']
[Jason Powers Resume(1).odt, Jason, Powers, Resume, (1).odt, Jason Powers Resume(1), odt, Jason Powers Resume(, 1, ').odt']
[Jason Powers Resume1.odt, Jason, Powers, Resume1, odt, Jason Powers Resume1, Jason Powers Resume, '1']
[Jessica Biel 01.jpg, Jessica, Biel, 01.jpg, Jessica Biel 01, jpg, Jessica Biel , '01']
[LICENSE.txt, LICENSE, 'txt']
[New Text Document.txt, New, Text, Document, txt, 'New Text Document']
[QCDPlayer.exe, QCD, Player, exe, 'QCDPlayer']
[Sample word doc.doc, Sample, word doc.doc, Sample word doc, 'doc']
[, Template, 'java']
[Thumbs.db, Thumbs, 'db']
[WSAD5 Book.pdf, WSA, D5, Book, pdf, WSAD5 Book, WSAD, 5, ' Book.pdf']

The objective of this whole thing is to make it possible to search for files by parts. Suppose you don't remember the filename of the file you know you want to find. You think to yourself, "Was it something about QCD and/or Player? Let's try both!" and you then search for "QCD or Player" and without having used this filename splitter the ZCatalog would not be able to find it. What do you think?

UPDATE Fixed a bug since first release so that filename is split on the . (dot) as well as the - (dash) in case the filename is "some-file.txt"


Calvin Spealman
The only difference I would make is to keep the leading . or replace it maybe with ext: for the extension of the filename, so you can search with keywords for the file type. But thats just me.
Peter Bengtsson
But I am splitting by space so that the extension is always in the list.
Calvin Spealman
Well, yes, the extension is in the list but I was making my suggestion because then you can be differentiate from when the extension is in the list or the same token happens to appear in the filename. Maybe it wouldn't come up horribly often, but it seems like it would be a useful and simple addition.
Joerg Baach
Looks like automatic generation of tags. Could be nicely used in a zope based spotlight replacement....
Thank you for posting a comment

Your email will never ever be published

Related posts

Pandora Update 14 November 2005
Old School Kung Fu 16 November 2005
DateIndex in Zope doesn't have indexed attributes 28 October 2007
Jealous of Google stemming 04 August 2005
New search feature on this site 13 September 2003