However, the morning was given over to working on and improving our Project Gutenberg algorithm. Which I never explained my version of, so:
For my first version, I did what I considered the obvious: I trained the program by taking the training books, making a hash for each category, and counting all the meaningful words. (We were given a list of "stopwords" that we could discount, like "and," "the," and "Gutenberg." Part of me rebelled against that since, as we know from Franco Moretti, you can sometimes learn a lot by looking at "the" vs. "a." Lit-nerd out.) I then took the top few words for each category and looked for them in the books-to-be-tested.
But since I was already done with that version on Thursday, I decided to spend my time looking at the TF-IDF version, where you
1) go through a text and count each time a word appears for that text (the TF or term frequency); and
2) go through a set of texts and count how many texts have that word--and then take the log of the ratio (the IDF or inverse document frequency.
Weird, right? Anyway, that gives you a number that tells you how important a word is to a set of documents. So you can use that word score to help judge what genre a book is in.
Which was a long and interesting trip that resulted in a score that wasn't too much higher. Ah well.
After that, we did more database stuff for our puppy program.
Also, ping pong. Am I getting better? Well, I'm not getting worse.
No comments:
Post a Comment