First Clojure tutorial

This post is the first of what I hope will become a series about Clojure. Clojure is a young language, and though there is a lot of documentation already on the Internet and in the blogs of many enthusiasts, I figured there would be no harm in having some more. In this series, I will start with a simple script and with each post, I will improve the program by introducing new Clojure features. The whole thing is not mapped out, and as such, I am very receptive to constructive comments on how to make these posts better.

The problem

The problem we will tackle on is a fairly simple one: scraping web sites. I am a of web comics enthusiast, I read quite a few of them, but I don't like to go on 20 web sites to view them all, and their RSS feeds are often not to my satisfaction: some feeds only give you a link to the latest strip, others have news, information or publicity in them that you don't want. So we will write a script that extracts the latest strip from web comic sites and create an RSS feed with those.

What we'll do today

In this first post, we'll get something very simple working: the program will download the content of a web site, extract the strip link and print it. This will allow us to view data structures and Java interop. I will not assume that you know any Clojure, so I'll try to explain as we go along. If something is unclear, you can always check out the documentation on the official Clojure web site. The pseudo code of our application will be as follows:

for each comic:
    get the html
    extract the image with a regex
    display the complete image URL

Data

We'll start our program by defining our data. We will want to scrape several comic strips and not have to write one function per web comic, so we'll need a standard way to represent the different comics we have. We will need four pieces of data:

Because most sites use relative links for their images, if no URL prefix is given, we will assume that the URL of the latest strip page is to be used as the prefix. We will represent the data of one comic with a hash-map and we will put all those hash-maps inside a vector. Here's the result with two comics:

(def *comics*
  [{:name "Penny-Arcade"
    :url "http://www.penny-arcade.com/comic/"
    :regex #"images/\\d{4}/.+?(?:png|gif|jpg)"
    :prefix "http://www.penny-arcade.com/"
    }
   {:name "We The Robots"
    :url "http://www.wetherobots.com/"
    :regex #"comics/.+?[.](?:jpg|png|gif)"
    }
  ])

A few notes about this piece of code:

That's actually quite a lot of notes for such a short piece of code! Now that we have our data, let's look at the next step, fetching the HTML from a URL.

Fetching the HTML

Java has a class to read documents through the HTTP protocol, which means that Clojure has a class to read documents through the HTTP protocol. Sadly, Java does not have a method to download an entire document as a string. We'll have to create our own function to do the deed. The classes that we'll need can be accessed by their fully-qualified names (e.g.: java.io.BufferedReader), but this tends to make the code long-winded. We'll use the import function to load the class names into the current namespace to keep our code shorter.

(import '(java.net URL)
        '(java.lang StringBuilder)
        '(java.io BufferedReader InputStreamReader))

import takes an arbitrary number of lists where the first element is a symbol representing the name of the package and the rest are the classes to be added to the namespace. Here, we import URL, StringBuilder, BufferedReader and InputStreamReader. Now, let's look at the code to download an HTML page:

(defn fetch-url
  "Return the web page as a string."
  [address]
  (let [url (URL. address)]
    (with-open stream (. url (openStream))
      (let [buf (BufferedReader. (InputStreamReader. stream))]
        (apply str (line-seq buf))))))

We'll look at the code line by line in just a moment, but let me first explain quickly what this function does. fetch-url is a function that takes an argument, address, uses this argument to create a new URL object and open a stream to that object. We then read all the lines from that stream, join them together and return one big string.

Phew, that was a lot to take in! Now that we've completed the second line of our pseudo code, we're ready to extract the image links.

Extracting the image link

The function used to get the image link is much shorter than fetch-url. We will pass a comic (a map), we will use the Clojure function re-find to find the string we are looking for and we will return it with the prefix.

Let's look at the code:

(defn image-url
  "Return the absolute URL of the image of a comic.
  If the comic has a prefix, prepend it to the URL,
  otherwise use the :url value."
  [comic]
  (let [src (fetch-url (:url comic))
        image (re-find (:regex comic) src)]
    (str (or (:prefix comic) (:url comic))
         image)))

This should now look familiar to you. A function of one argument with a documentation string. We won't look at every line, instead I'll explain the important parts:

Printing the URLs

Finally, we can print the URLs. We will use the doseq macro for this purpose, which is practically a foreach loop. doseq takes three argument: the name of an individual item, a collection and a body. We will print the name of the comic and the URL of its latest strip.

(doseq comic *comics*
  (println (str (:name comic) ": " (image-url comic))))

This should give us the following output:

Penny-Arcade: http://www.penny-arcade.com/images/2008/20081029.jpg
We The Robots: http://www.wetherobots.com/comics/2008-10-22-Storytime.jpg

Next time

Next time, we'll look at how multimethods can help us to handle cases such as Xkcd where we also want to get the URL of the strip, but also the alt text to have a complete strip.