Kung Fu Kung Fu

Fujian White Crane Kung Fu

Zope Zope

What I have and am doing with Zope

Photos Photos

Photoalbum, both old and new.

Receptsamlingen Receptsamlingen

In Swedish only. About my "Collection of Recipes" website.

Contact me Contact me

My contact details and how to contact me.

  Mobile version of this page Mobile version of this page


 

Playing with Reverend Bayesian

reverend, bayesian, sections

18th of October 2005

I've been playing around with Reverend a bit on getting it to correctly guess appropriate "Sections" for issues on the Real issuetracker. What I did was that I downloaded all 140 issuetexts and their "Sections" attribute which is a list (that is often of length 1). From list dataset I did a loop over each text and the sections within it (skipped the default section General) so something like this:

 data = ({'sections':['General','Installation'], 
          'text':"bla bla bla..."}
         {'sections':['Filter functions'], 
          'text':"Lorem ipsum foo bar..."}
         ...)
 for item in data:
     secs = [each for each item['sections'] if each != 'General']
     for section in secs:
         guesser.train(section, item['text'])

Now, perhaps I should mention how I set up the guesser. Well, I just took the example code from the Divmod homepage:

 from reverend.thomas import Bayes
 guesser = Bayes()

Then, in my big loop I also randomly set aside about 10% for sample testing on the train Bayesian classifier. This I then used to see if I could guess the section based on the text rather. Something like this:

 for item in data:
    results = sorted(guesser.guess(item['text']))
    print "Correct answer", item['sections']
    for section, score in results:
        print section, score

To see some sample result output, download these small files: section_classifier_result1.log, section_classifier_result2.log, section_classifier_result3.log

If you want to try the code you have to download the dataset and just use it like this:

 $ python section_classifier.py 

Conclusion

I guess the results aren't too bad but still quite useless. They would only be good enough as suggestions. What you would need is a much larger set and as an application, for an issuetracker 140 issues is quite a lot of training. Imagine how much worse the suggestions would when the training material is very sparse. One great thing about Reverend is that it's very fast. I did a quick benchmark on the actual training part of that script and found that in total it took the Bayesian object 0.15 seconds to get trained on 52,000 characters. Bare in mind that this is quite irrelevant because if performance is an issue you'd probably want to store the trained Bayesian object persistently.


Comment

 
Name:
Email:
hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.