I need to create a Zope index for a ZCatalog that is KeywordIndex. A KeywordIndex is a list (array if you like) that is used to describe some data. For example, if the data is "Peter is a Swedish Londoner", the the keywords are ("peter", "swedish", "londoner"). What about if the data you want to create an index of is a filename like "NameLog.txt" or "holiday-00412-juli-05.jpg". I've now quickly written a little something that seems to do a decent job. It splits the filenames (these are filenames only and no paths) by caMel structure, dot (.), underscore (_), dash (-) and digits.

If you want to play with my little script, have a look at filenamesplitter.py If you open that script you'll see that it tests a whole bunch of filenames (taken from the Demo issuetracker) and if you want to see what this the result is, here it is:

[Error-04August2005.log, Error, 04, August2005, log, Error-04August2005, 04August2005.log, August, '2005']
[ITPt.rtf, IT, Pt, rtf, 'ITPt']
[Image69.png, Image69, png, Image, '69']
[Image70.png, Image70, png, Image, '70']
[IssueTracker.py, Issue, Tracker, py, 'IssueTracker']
[IssueUserFolder.py, Issue, User, Folder, py, 'IssueUserFolder']
[Request_NameError.txt, Request, Name, Error, txt, Request_NameError, 'NameError.txt']
[STATSPAG.jpg, STATSPAG, 'jpg']
[Traceback_NameError.txt, Traceback, Name, Error, txt, Traceback_NameError, 'NameError.txt']
[addhrefs-0.8-dev.tgz, addhrefs-0, 8-dev, tgz, addhrefs, 0.8, dev.tgz, 0, '8']
[ajax_bug.bmp, ajax_bug, bmp, ajax, 'bug.bmp']
[catalogEntries.png, catalog, Entries, png, 'catalogEntries']
[demo-icatalog.png, demo-icatalog, png, demo, 'icatalog.png']
[doc2.htm, doc2, htm, doc, '2']
[dummy.dtml, dummy, 'dtml']
[1027_Sample_Chapter.pdf, 1027, Sample, Chapter, pdf, 1027_Sample_Chapter, Chapter.pdf, 'Sample_Chapter.pdf']
[10erBarcode.jpg, 10er, Barcode, jpg, 10erBarcode, 10, 'erBarcode.jpg']
[11111.txt, 11111, 'txt']
[25.gif, 25, 'gif']
[60recicla.gif, 60recicla, gif, 60, 'recicla.gif']
[DQS_certified.gif, DQS_certified, gif, DQS, 'certified.gif']
[Innovations in Behavioral Marketing and Electronic Commerce.doc, Innovations, in , Behavioral, Marketing, and , Electronic, Commerce, doc, 'Innovations in Behavioral Marketing and Electronic Commerce']
[Jason Powers Resume(1).odt, Jason, Powers, Resume, (1).odt, Jason Powers Resume(1), odt, Jason Powers Resume(, 1, ').odt']
[Jason Powers Resume1.odt, Jason, Powers, Resume1, odt, Jason Powers Resume1, Jason Powers Resume, '1']
[Jessica Biel 01.jpg, Jessica, Biel, 01.jpg, Jessica Biel 01, jpg, Jessica Biel , '01']
[LICENSE.txt, LICENSE, 'txt']
[NOTEPAD.EXE, NOTEPAD, 'EXE']
[New Text Document.txt, New, Text, Document, txt, 'New Text Document']
[QCDPlayer.exe, QCD, Player, exe, 'QCDPlayer']
[Sample word doc.doc, Sample, word doc.doc, Sample word doc, 'doc']
[Template.java, Template, 'java']
[Thumbs.db, Thumbs, 'db']
[WSAD5 Book.pdf, WSA, D5, Book, pdf, WSAD5 Book, WSAD, 5, ' Book.pdf']

The objective of this whole thing is to make it possible to search for files by parts. Suppose you don't remember the filename of the file you know you want to find. You think to yourself, "Was it something about QCD and/or Player? Let's try both!" and you then search for "QCD or Player" and without having used this filename splitter the ZCatalog would not be able to find it. What do you think?

UPDATE Fixed a bug since first release so that filename is split on the . (dot) as well as the - (dash) in case the filename is "some-file.txt"

Comments

Calvin Spealman

The only difference I would make is to keep the leading . or replace it maybe with ext: for the extension of the filename, so you can search with keywords for the file type. But thats just me.

Peter Bengtsson

But I am splitting by space so that the extension is always in the list.

Calvin Spealman

Well, yes, the extension is in the list but I was making my suggestion because then you can be differentiate from when the extension is in the list or the same token happens to appear in the filename. Maybe it wouldn't come up horribly often, but it seems like it would be a useful and simple addition.

Joerg Baach

Looks like automatic generation of tags. Could be nicely used in a zope based spotlight replacement....

Your email will never ever be published.

Related posts