I've been playing around with Reverend a bit on getting it to correctly guess appropriate "Sections" for issues on the Real issuetracker. What I did was that I downloaded all 140 issuetexts and their "Sections" attribute which is a list (that is often of length 1). From list dataset I did a loop over each text and the sections within it (skipped the default section General) so something like this:


data = ({'sections':['General','Installation'], 
         'text':"bla bla bla..."}
        {'sections':['Filter functions'], 
         'text':"Lorem ipsum foo bar..."}
        ...)
for item in data:
    secs = [each for each item['sections'] if each != 'General']
    for section in secs:
        guesser.train(section, item['text'])

Now, perhaps I should mention how I set up the guesser. Well, I just took the example code from the Divmod homepage:


from reverend.thomas import Bayes
guesser = Bayes()

Then, in my big loop I also randomly set aside about 10% for sample testing on the train Bayesian classifier. This I then used to see if I could guess the section based on the text rather. Something like this:


for item in data:
   results = sorted(guesser.guess(item['text']))
   print "Correct answer", item['sections']
   for section, score in results:
       print section, score

To see some sample result output, download these small files: section_classifier_result1.log, section_classifier_result2.log, section_classifier_result3.log

If you want to try the code you have to download the dataset and just use it like this:


$ python section_classifier.py 

Conclusion

I guess the results aren't too bad but still quite useless. They would only be good enough as suggestions. What you would need is a much larger set and as an application, for an issuetracker 140 issues is quite a lot of training. Imagine how much worse the suggestions would when the training material is very sparse. One great thing about Reverend is that it's very fast. I did a quick benchmark on the actual training part of that script and found that in total it took the Bayesian object 0.15 seconds to get trained on 52,000 characters. Bare in mind that this is quite irrelevant because if performance is an issue you'd probably want to store the trained Bayesian object persistently.

Comments

Your email will never ever be published.

Previous:
Furious and deprived by 'rm *' October 15, 2005 Linux
Next:
"Increment numbers in a string" October 20, 2005 Python
Related by category:
How I run standalone Python in 2025 January 14, 2025 Python
get in JavaScript is the same as property in Python February 13, 2025 Python
How to resolve a git conflict in poetry.lock February 7, 2020 Python
Best practice with retries with requests April 19, 2017 Python