I've been playing around with Reverend a bit on getting it to correctly guess appropriate "Sections" for issues on the Real issuetracker. What I did was that I downloaded all 140 issuetexts and their "Sections" attribute which is a list (that is often of length 1). From list dataset I did a loop over each text and the sections within it (skipped the default section General) so something like this:

data = ({'sections':['General','Installation'], 
         'text':"bla bla bla..."}
        {'sections':['Filter functions'], 
         'text':"Lorem ipsum foo bar..."}
for item in data:
    secs = [each for each item['sections'] if each != 'General']
    for section in secs:
        guesser.train(section, item['text'])

Now, perhaps I should mention how I set up the guesser. Well, I just took the example code from the Divmod homepage:

from reverend.thomas import Bayes
guesser = Bayes()

Then, in my big loop I also randomly set aside about 10% for sample testing on the train Bayesian classifier. This I then used to see if I could guess the section based on the text rather. Something like this:

for item in data:
   results = sorted(guesser.guess(item['text']))
   print "Correct answer", item['sections']
   for section, score in results:
       print section, score

To see some sample result output, download these small files: section_classifier_result1.log, section_classifier_result2.log, section_classifier_result3.log

If you want to try the code you have to download the dataset and just use it like this:

$ python section_classifier.py 


I guess the results aren't too bad but still quite useless. They would only be good enough as suggestions. What you would need is a much larger set and as an application, for an issuetracker 140 issues is quite a lot of training. Imagine how much worse the suggestions would when the training material is very sparse. One great thing about Reverend is that it's very fast. I did a quick benchmark on the actual training part of that script and found that in total it took the Bayesian object 0.15 seconds to get trained on 52,000 characters. Bare in mind that this is quite irrelevant because if performance is an issue you'd probably want to store the trained Bayesian object persistently.


Your email will never ever be published.

Related posts