Crosstips.org

My fun Crossword solver project. Crosstips.org & Krysstips.se

Kung Fu

Fujian White Crane Kung Fu

Fry-IT

Fry-IT is the company I work for

Photos

Photoalbum, both old and new.

Zope

What I have and am doing with Zope

Receptsamlingen

In Swedish only. About my "Collection of Recipes" website.

Contact me

My contact details and how to contact me.

 

KungFuPeople.com
Do you train Kung Fu?
Or know someone who does?
Then check out KungFuPeople.com


Mobile version of this page Mobile version of this page


 

Filename splitter


zcatalog, keywordindex, filenamesplitter, splitter

15th of November 2005

I need to create a Zope index for a ZCatalog that is KeywordIndex. A KeywordIndex is a list (array if you like) that is used to describe some data. For example, if the data is "Peter is a Swedish Londoner", the the keywords are ("peter", "swedish", "londoner"). What about if the data you want to create an index of is a filename like "NameLog.txt" or "holiday-00412-juli-05.jpg". I've now quickly written a little something that seems to do a decent job. It splits the filenames (these are filenames only and no paths) by caMel structure, dot (.), underscore (_), dash (-) and digits.

If you want to play with my little script, have a look at filenamesplitter.py If you open that script you'll see that it tests a whole bunch of filenames (taken from the Demo issuetracker) and if you want to see what this the result is, here it is:

[Error-04August2005.log, Error, 04, August2005, log, Error-04August2005, 04August2005.log, August, '2005']
[ITPt.rtf, IT, Pt, rtf, 'ITPt']
[Image69.png, Image69, png, Image, '69']
[Image70.png, Image70, png, Image, '70']
[IssueTracker.py, Issue, Tracker, py, 'IssueTracker']
[IssueUserFolder.py, Issue, User, Folder, py, 'IssueUserFolder']
[Request_NameError.txt, Request, Name, Error, txt, Request_NameError, 'NameError.txt']
[STATSPAG.jpg, STATSPAG, 'jpg']
[Traceback_NameError.txt, Traceback, Name, Error, txt, Traceback_NameError, 'NameError.txt']
[addhrefs-0.8-dev.tgz, addhrefs-0, 8-dev, tgz, addhrefs, 0.8, dev.tgz, 0, '8']
[ajax_bug.bmp, ajax_bug, bmp, ajax, 'bug.bmp']
[catalogEntries.png, catalog, Entries, png, 'catalogEntries']
[demo-icatalog.png, demo-icatalog, png, demo, 'icatalog.png']
[doc2.htm, doc2, htm, doc, '2']
[dummy.dtml, dummy, 'dtml']
[1027_Sample_Chapter.pdf, 1027, Sample, Chapter, pdf, 1027_Sample_Chapter, Chapter.pdf, 'Sample_Chapter.pdf']
[10erBarcode.jpg, 10er, Barcode, jpg, 10erBarcode, 10, 'erBarcode.jpg']
[11111.txt, 11111, 'txt']
[25.gif, 25, 'gif']
[60recicla.gif, 60recicla, gif, 60, 'recicla.gif']
[DQS_certified.gif, DQS_certified, gif, DQS, 'certified.gif']
[Innovations in Behavioral Marketing and Electronic Commerce.doc, Innovations, in , Behavioral, Marketing, and , Electronic, Commerce, doc, 'Innovations in Behavioral Marketing and Electronic Commerce']
[Jason Powers Resume(1).odt, Jason, Powers, Resume, (1).odt, Jason Powers Resume(1), odt, Jason Powers Resume(, 1, ').odt']
[Jason Powers Resume1.odt, Jason, Powers, Resume1, odt, Jason Powers Resume1, Jason Powers Resume, '1']
[Jessica Biel 01.jpg, Jessica, Biel, 01.jpg, Jessica Biel 01, jpg, Jessica Biel , '01']
[LICENSE.txt, LICENSE, 'txt']
[NOTEPAD.EXE, NOTEPAD, 'EXE']
[New Text Document.txt, New, Text, Document, txt, 'New Text Document']
[QCDPlayer.exe, QCD, Player, exe, 'QCDPlayer']
[Sample word doc.doc, Sample, word doc.doc, Sample word doc, 'doc']
[Template.java, Template, 'java']
[Thumbs.db, Thumbs, 'db']
[WSAD5 Book.pdf, WSA, D5, Book, pdf, WSAD5 Book, WSAD, 5, ' Book.pdf']

The objective of this whole thing is to make it possible to search for files by parts. Suppose you don't remember the filename of the file you know you want to find. You think to yourself, "Was it something about QCD and/or Player? Let's try both!" and you then search for "QCD or Player" and without having used this filename splitter the ZCatalog would not be able to find it. What do you think?

UPDATE Fixed a bug since first release so that filename is split on the . (dot) as well as the - (dash) in case the filename is "some-file.txt"



Comment

Calvin Spealman - 15th November 2005  [«« Reply to this]
The only difference I would make is to keep the leading . or replace it maybe with ext: for the extension of the filename, so you can search with keywords for the file type. But thats just me.
Peter Bengtsson - 15th November 2005   [«« Reply to this]
But I am splitting by space so that the extension is always in the list.
Calvin Spealman - 16th November 2005   [«« Reply to this]
Well, yes, the extension is in the list but I was making my suggestion because then you can be differentiate from when the extension is in the list or the same token happens to appear in the filename. Maybe it wouldn't come up horribly often, but it seems like it would be a useful and simple addition.
Joerg Baach - 15th November 2005  [«« Reply to this]
Looks like automatic generation of tags. Could be nicely used in a zope based spotlight replacement....
 
Name:
Email:
hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.