Friday, September 5, 2014

MakerSquare Day 7: Puppies and databases--two great flavors! (2/3)

I have added a "puppy" label, since much of this week has been dedicated to that puppy project, which has steadily grown: first we did it; then we added some new functions; and now we're really going deep into databases.

However, the morning was given over to working on and improving our Project Gutenberg algorithm. Which I never explained my version of, so: 

For my first version, I did what I considered the obvious: I trained the program by taking the training books, making a hash for each category, and counting all the meaningful words. (We were given a list of "stopwords" that we could discount, like "and," "the," and "Gutenberg." Part of me rebelled against that since, as we know from Franco Moretti, you can sometimes learn a lot by looking at "the" vs. "a." Lit-nerd out.) I then took the top few words for each category and looked for them in the books-to-be-tested.

But since I was already done with that version on Thursday, I decided to spend my time looking at the TF-IDF version, where you

1) go through a text and count each time a word appears for that text (the TF or term frequency); and

2) go through a set of texts and count how many texts have that word--and then take the log of the ratio (the IDF or inverse document frequency.

Weird, right? Anyway, that gives you a number that tells you how important a word is to a set of documents. So you can use that word score to help judge what genre a book is in.

Which was a long and interesting trip that resulted in a score that wasn't too much higher. Ah well.

After that, we did more database stuff for our puppy program.

Also, ping pong. Am I getting better? Well, I'm not getting worse.

No comments:

Post a Comment