Saturday, December 20, 2014

Word frequency counter in Elixir

Before I got into web/software dev seriously, I was pretty serious about literary analysis. (Ask me sometime about how pregnancy and history work in Edgar Rice Burroughs's Princess of Mars and Ursula K. Le Guin's Left Hand of Darkness.) And these two loves came together a few years ago in a new-ish field called "digital humanities": the crunching of literary data with computers.

For instance, we could go through an entire writer's work and see what words got used most often--or not at all. (Fun trivia for nerds: Lovecraft uses the word "squamous" only once, which is funny because parodies of Lovecraft love that word.)

Which is a long intro to explain why I like writing word-frequency counters in new programming languages. So, to count words in Elixir, you could use this:
  1. defmodule Words do
  2. @doc """
  3. Count the number of words in the sentence.
  4. Words are compared case-insensitively.
  5. """
  6. @spec count(String.t) :: map()
  7. def count(sentence) do
  8. sentence
  9. |> prep 
  10. |> count_words
  11. end

  12. defp prep(sentence) do
  13. sentence
  14. |> String.replace(~r/([^\w-]|_)+/u, " ")
  15. |> String.downcase
  16. |> String.split
  17. end

  18. defp count_words(words) do
  19. Enum.reduce(words, Map.new, 
  20. fn(word, map) ->
  21. Map.update(map, word, 1, &(&1 + 1))
  22. end)
  23. end

  24. end
Commentary: @doc and """ are for heredocs. Now if I type "h count" into the terminal, I'll get back that info.

You'll also note two things: (1) the program is written with two helper functions, in classic modular fashion (and these functions are defined with defp, which makes them private functions, only call-able by functions within the module); (2) Elixir uses pipes (|>) as a way of handling and handing off data. And I love pipes. 

Check out prep, a pretty straightforward way to prep a sentence for counting (with line numbers to help follow): 
(14) it takes the sentence; 
(15) runs it through a regex replacer to get rid of anything that isn't a word; 
(16) then runs that new string of just letters through the downcase function; 
(17) then runs that newly downcased string through a split function, which works like all split functions seem to work, taking a string and returning a list of strings. 

Now, if I wasn't piping, I would have to include the parameter, like
String.split(sentence)
But when piping, the first parameter is assumed to be whatever is piped in. Now, without piping, I could write this sequence of functions pretty easily, and it would look like this:

String.split (String.downcase (String.replace(sentence, ~r/([^\w-]|_)+/u, " ")))

Which I can read, but which is a little less intuitive, because you have to read it backwards, with every left-side function taking as parameter the output of the right-side function. Yuck.

Then we get to the heart of the word counter program, the count_words function. This function is doing something interesting--and wasn't my first version of this.

My first version:
  1. defp count_words([], acc), do: acc 
  2. defp count_words([head | tail], acc) do 
  3. quantity = Map.get(acc, head, 0) 
  4. acc = Map.put(acc, head, quantity + 1) 
  5. count_words(tail, acc)
  6. end
My first version was a fairly standard tail recursive function that went through the list, calling itself until the list is empty. When the list is empty--i.e., when line 1 is called because the first parameter matches the empty list []--it returns the accumulator. If the list is not empty, it processes the word in a pretty standard way: words and values are saved in a map (which is a key-value structure like a Ruby hash), so I pull the old quantity from the map and then update the map with the new quantity for the word.

So let's look again at the second (or third) version:

Second version:
  1. defp count_words(words) do
  2. Enum.reduce(words, Map.new, 
  3. fn(word, map) ->
  4. Map.update(map, word, 1, &(&1 + 1))
  5. end)
  6. end
So, the heart of this is still a Map function; here, we call Map.update with the map to be updated (map); the key to be updated (word); the initial value to be used if the key is not found (1), and a function that tells how to transform the value if the key is found ("&(&1 + 1)").

We could rewrite that to make it clearer for new Elixir users, like: 
Map.update(map, word, 1, fn(x) -> x + 1 end)
But the real magic is Enum.reduce, which does all the work of going through a list until it's empty and resolving all the data in that list into a single structure or value. For instance, a classic use of Enum.reduce would be to sum all the numbers of a list:
Enum.reduce([1, 2, 3], 0, fn(x, acc) -> (x + acc) end)
So we have the list to be reduced ([1, 2, 3]); the initial value to use as the accumulator (0); and a function that tells reduce how to resolve all the elements of the list into a single value ("fn(x, acc) -> (x + acc) end").

(P.S. That's the long way to write the function, which I did to make the action clear; i.e., we take two parameters, the element of the list (x) and the accumulator (which starts at 0), and we add each of them. The really short way to say that would be &(&1 + &2). Awesome.)

So the Enum.reduce in this function takes the list of words; accumulates it in an empty map; and the function that it uses to resolve the list into the map is ... the Map.update function that adds one to the value of the word each time it finds that word.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Sweet. Can you please explain how the regex works?

    ReplyDelete