Saturday, December 20, 2014

Word frequency counter in Elixir

Before I got into web/software dev seriously, I was pretty serious about literary analysis. (Ask me sometime about how pregnancy and history work in Edgar Rice Burroughs's Princess of Mars and Ursula K. Le Guin's Left Hand of Darkness.) And these two loves came together a few years ago in a new-ish field called "digital humanities": the crunching of literary data with computers.

For instance, we could go through an entire writer's work and see what words got used most often--or not at all. (Fun trivia for nerds: Lovecraft uses the word "squamous" only once, which is funny because parodies of Lovecraft love that word.)

Which is a long intro to explain why I like writing word-frequency counters in new programming languages. So, to count words in Elixir, you could use this:
  1. defmodule Words do
  2. @doc """
  3. Count the number of words in the sentence.
  4. Words are compared case-insensitively.
  5. """
  6. @spec count(String.t) :: map()
  7. def count(sentence) do
  8. sentence
  9. |> prep 
  10. |> count_words
  11. end

  12. defp prep(sentence) do
  13. sentence
  14. |> String.replace(~r/([^\w-]|_)+/u, " ")
  15. |> String.downcase
  16. |> String.split
  17. end

  18. defp count_words(words) do
  19. Enum.reduce(words, Map.new, 
  20. fn(word, map) ->
  21. Map.update(map, word, 1, &(&1 + 1))
  22. end)
  23. end

  24. end
Commentary: @doc and """ are for heredocs. Now if I type "h count" into the terminal, I'll get back that info.

You'll also note two things: (1) the program is written with two helper functions, in classic modular fashion (and these functions are defined with defp, which makes them private functions, only call-able by functions within the module); (2) Elixir uses pipes (|>) as a way of handling and handing off data. And I love pipes. 

Check out prep, a pretty straightforward way to prep a sentence for counting (with line numbers to help follow): 
(14) it takes the sentence; 
(15) runs it through a regex replacer to get rid of anything that isn't a word; 
(16) then runs that new string of just letters through the downcase function; 
(17) then runs that newly downcased string through a split function, which works like all split functions seem to work, taking a string and returning a list of strings. 

Now, if I wasn't piping, I would have to include the parameter, like
String.split(sentence)
But when piping, the first parameter is assumed to be whatever is piped in. Now, without piping, I could write this sequence of functions pretty easily, and it would look like this:

String.split (String.downcase (String.replace(sentence, ~r/([^\w-]|_)+/u, " ")))

Which I can read, but which is a little less intuitive, because you have to read it backwards, with every left-side function taking as parameter the output of the right-side function. Yuck.

Then we get to the heart of the word counter program, the count_words function. This function is doing something interesting--and wasn't my first version of this.

My first version:
  1. defp count_words([], acc), do: acc 
  2. defp count_words([head | tail], acc) do 
  3. quantity = Map.get(acc, head, 0) 
  4. acc = Map.put(acc, head, quantity + 1) 
  5. count_words(tail, acc)
  6. end
My first version was a fairly standard tail recursive function that went through the list, calling itself until the list is empty. When the list is empty--i.e., when line 1 is called because the first parameter matches the empty list []--it returns the accumulator. If the list is not empty, it processes the word in a pretty standard way: words and values are saved in a map (which is a key-value structure like a Ruby hash), so I pull the old quantity from the map and then update the map with the new quantity for the word.

So let's look again at the second (or third) version:

Second version:
  1. defp count_words(words) do
  2. Enum.reduce(words, Map.new, 
  3. fn(word, map) ->
  4. Map.update(map, word, 1, &(&1 + 1))
  5. end)
  6. end
So, the heart of this is still a Map function; here, we call Map.update with the map to be updated (map); the key to be updated (word); the initial value to be used if the key is not found (1), and a function that tells how to transform the value if the key is found ("&(&1 + 1)").

We could rewrite that to make it clearer for new Elixir users, like: 
Map.update(map, word, 1, fn(x) -> x + 1 end)
But the real magic is Enum.reduce, which does all the work of going through a list until it's empty and resolving all the data in that list into a single structure or value. For instance, a classic use of Enum.reduce would be to sum all the numbers of a list:
Enum.reduce([1, 2, 3], 0, fn(x, acc) -> (x + acc) end)
So we have the list to be reduced ([1, 2, 3]); the initial value to use as the accumulator (0); and a function that tells reduce how to resolve all the elements of the list into a single value ("fn(x, acc) -> (x + acc) end").

(P.S. That's the long way to write the function, which I did to make the action clear; i.e., we take two parameters, the element of the list (x) and the accumulator (which starts at 0), and we add each of them. The really short way to say that would be &(&1 + &2). Awesome.)

So the Enum.reduce in this function takes the list of words; accumulates it in an empty map; and the function that it uses to resolve the list into the map is ... the Map.update function that adds one to the value of the word each time it finds that word.

Saturday, December 13, 2014

Shiny new computer, cruddy old flu, fun new Fizzbuzz in Elixir

I just updated my LinkedIn profile, so I might as well update you here: my new job is as a junior developer at Clutch Analytics, the digital invention arm of Windhaven Insurance. And this last week was my first week there.

Which makes it the perfect week for me to get the flu. Ah well, that's only tangentially related to coding, so I'll leave it at this: a slight fever might, occasionally, be helpful when studying a new language.

Another issue that's only slightly related to coding, but which I have to bring up here for other junior developers--especially those who come straight out of school without much work experience: starting a new job involves a lot of logistics, including filling out paperwork. (Note to self: fill out paperwork this weekend.)

So now I have a new(er) computer, set up with my preferred configuration; the lingering effects of a flu; some health insurance paperwork to fill out; and some fun Elixir challenges. To that end, I've (thank god) gotten back to committing to my Github, which has a new repo for Elixir challenges. The first, of course, is Fizzbuzz.

Saturday, December 6, 2014

Elixir Week, Day Five: Not quite as Elixiry as expected

When I titled my last blog post "Day One," I thought I would do a little Elixir each day and report on the fun things I found--

like the difference between case and cond; or how the h before a function name actually gets the documentation from the code file

--but since this week was also my last week home before moving to Austin, I've had quite a few other things to do. Like find a place to live and shower my dog with affection in the form of walks.

And also look through some Tom Green Country library books. In the future, I will create a page just for the books I've read, with some notes, both general and specific.

For instance, I read that O'Reilly book on Version Control with Git. (I do love to see which animal gets which book: Version Control is a bat.) So the general remark might be: "In-depth look at Git, more than really necessary for daily use. Excellent chapter on merge control." And some specific remarks might be "Use rm command."

(For some reason, I've never had reason to rm a file from my git index.)

And about Learning JavaScript Design Patterns, I'd probably say "Get it for the MV* chapter; though some of the other material feels a little too abstract if you're not trying to apply it to project; and some of it feels a little too concrete if ditto. Not an introductory book to design patterns."

But that seems to be the way with computer books, I find: there's a lot of danger that they'll be too abstract or too concrete or too in-depth for what you need right now.

So, you read any good books lately?

Tuesday, December 2, 2014

Elixir Week, Day One: books and just-in-time-learning

In a week, I start work at my first engineer/developer job; and so, while I'm trying to wrap up some other projects (including purely logistical (i.e., boring) projects having to do with transportation and shelter, ho-hum), I'm also trying to do some prep work for that job.

Which involves getting myself up to speed on Elixir, a functional, concurrent language built on top of the Erlang Virtual Machine. (Shout out here to my 10+ year old comp sci class for covering virtual machines. Of course, we couldn't cover Elixir back then, since Elixir is only 2 or so years old.) And so, I'm declaring the first week of December, Elixir Week.

::fanfare::

Elixir Week, Day One

Thankfully, I had no trouble installing Elixir with just brew install elixir on my MacBook, which nicely installed Erlang as well. I've seen so many hours gets eaten by some seemingly simple step that I'm always thankful when this first step goes smoothly.

(If it doesn't for you, there's a few different methods for installing Elixir, some more involved than others, and hopefully one will work for you. See here for installation instructions.)

And so I could dive right into the official (I guess) Getting Started guide for Elixir, which starts, as most guides do, by talking about basic types, like integers, strings, and atoms.

Er, what's an atom? In Ruby, we'd call it a Symbol, but the Elixir guide goes on to define an atom:
Atoms are constants where their name is their own value.
And already I know this is a language that Jorge Luis Borges would love. There's a lot more in that getting started guide, but I just want to point out (for myself) some things that will be useful to remember.

> When talking about a function, the convention is name / arity, where arity is how many parameters it takes. So: IO.puts/1 refers to a function (IO.puts) that takes one parameter.

> To get help about a function, type and then the function name/arity.

> Elixir data types are immutable. (Not news, but something I'm going to need repeated.)

Books and JIT learning

Reading up on Elixir is fun, but reading about a language is, let's say, only the first step in getting to know a language. Trust me: I used to pronounce "awry" as "aw-ree" because I only ever saw it in books.

I'm not sure if there's an ideal way to learn a language; it probably differs for everyone. For me, I like a little abstract learning and a little getting-my-hands-dirty in equal measure. So while I've been reading the Getting Started guide, I've occasionally turned to my Elixir terminal and tried a few things. "Oh, I can do a boolean AND like this--but can I do it like this?"

Which is to say that I'm a big fan of just-in-time learning. I can pack my head full of data about a language, but they only start to become real information when I put it into practice. (Or: fail to put it into practice. That's a teachable moment right there.)

And that's why I have a complicated relationship with many coding books. Give me a book with some coding challenges interspersed and I'm pretty happy. Don't--and I won't be. Which is why right now, while I'm in San Angelo, I'm flipping though a library book on Git. It's not a deep dive into all the mechanics of Git. (Do I really need to know the history of Git to do my job? Probably not.) But it is one of those aforementioned projects that I'm trying to wind down before work starts.