Author Archives: Lou Franco

20 Days of Clojure: Day 8

Rich Hickey comments on yesterday’s post on the clojure google group:

An interesting point of comparison with purely-functional solutions such as that which Okasaki writes about is that the Clojure vector can be log32N specifically because I can use mutation – when cloning, i.e. the tree is log32N deep and a node can be copied in O(1) time due to arraycopy + indexed array set. And yes, the vectors are optimized for lookup – in practice, log32N blows logN out of the water.

Exactly — PersistentVector’s internals are not purely functional. New objects (during construction only) make an array copy of Object arrays from existing vectors, then make changes to the copied array before returning. As Rich points out, this makes some operations O(1). And for comparison, a clojure vector of 32k elements is only 2 levels deep, but if implemented with a log2N structure would be 15 deep (with average access in maybe 7 operations). Since access by index is a very common operation, you can see that this would make a big difference.

Today, I want to look at PersistentHashMap.java which is the implementation of clojure hash-maps. There are actually four implementations of the map interface (hash-map, sorted-map, struct-map, and array-map), but hash-map is what the reader will give you for when you build up a map with {}.

A comment on the top of the source file says that it’s a persistent version of Phil Bagwell’s Hash Array Mapped Trie. According to that paper, it looks like the bits of hashes of the key are used to traverse a trie. Let’s see how it’s done in clojure.

The first thing I like to look at is how lookup is done, that gives me a sense of the structure. So, to start with, here’s entryAt:

 public IMapEntry entryAt(Object key){
  return root.find(RT.hash(key), key);
 }

Simple enough — it uses the root node to start the traversal. The root node is an INode, and clojure uses polymorphism to get different behavior for different nodes. There are five types of nodes,
EmptyNode, LeafNode, HashCollisionNode, BitmapIndexedNode, and FullNode. An empty hash-map has its root set to an instance of EmptyNode, which is an implementation of the Null Object design pattern, and find() on it returns null (by the way, the entire structure is a Composite design pattern, and by using each node type to handle part of traversal, I guess he’s also using a Chain of Responsibility — I mention this because there are many OO design patterns used in the Java implementation of clojure — some by name, so if you are going to look at the code, you’ll want to be familiar with these patterns).

If I assoc() onto a hash-map, root.assoc() is called which is supposed to return a new root. EmptyNode.assoc() simply creates a LeafNode with the hash, key and value, so a map with one key has its root set to a LeafNode.

If I call find() on a LeafNode, that’s pretty simple too. If this is the node I am looking for, it returns this, otherwise null.

If I assoc onto a LeafNode, here is what can happen:

  1. If the key and value match the LeafNode, then it is just returned — nothing is created since you haven’t changed anything.
  2. If just the key is the same as this node, a new LeafNode is created with the same key and the new value and returned (remember, this is a persistent data-structure, so we do not change the LeafNode, we create new nodes to build up a new hash-map).
  3. If the hash is the same, but the key is different, then a new HashCollisionNode is created that holds onto this LeafNode and a new one that has the same key and the new value. HashCollisionNodes simply hold onto an array of LeafNodes whose keys have the same hash.
  4. If the new key doesn’t have the same hash (the common case), then a BitmapIndexedNode will be created. At first it contains just the old LeafNode in its array of nodes, and then assoc() is called on it with the new key and value.

Case #4 is where we start to build up the trie. The hash of the key is going to be used to traverse the trie later, and a BitmapIndexedNode knows if it contains the LeafNode you are looking for. A BitmapIndexedNode has a bitmap of 32 bits. The bit corresponding to the bottom five bits of the hash of the LeafNode is turned on to indicate that the LeafNode is to be found in this branch (it can have 32 branches). For example if the bottom five bits of the key hash are 01100, then the 12th bit of the bitmap is turned on, and the node is put at this branch.

If the bitmap bit for the new leaf node is off, then a new BitmapIndexedNode is created with a new bitmap with the new bit turned on, and the node is put in its array which will be one bigger than the one we started with. If we kept doing this and hit each bit, the 32nd assoc would result in a FullNode — which is a simple case of a BitmapIndexedNode, except it knows that if it had a bitmap, all bits would be set. Again, remember, we create a new node because we never change nodes in a persistent data structure.

If the new leaf node has a bottom five bits that match a node already in this BitmapIndexedNode (the corresponding bit is already turned on), then we assoc() this key/value onto the node with the same five bits.

I haven’t mentioned this yet, but assoc takes a shift argument — so far, that has always been 0 and that meant to use the bottom five bits (use a 0 shift) of the hash in any BitmapIndexedNodes. On this assoc call, we’re now going to pass in (shift+5) so that the next 5 bits will be used on any BitmapIndexedNodes created under here. Later when we traverse, we’ll use the bottom five bits on the first BitmapIndexedNode we see, then the next five bits on the next one and so on. When we’ve used up all 32 bits of the hash, we should be at our LeafNode or a HashCollisionNode (meaning all 32 bits matched).

So here’s find():

  1. If the node is an EmptyNode, it returns null.
  2. If the node is a LeafNode, it returns itself if the key matches, otherwise null.
  3. If the node is a HashCollisionNode, it calls find() on each LeafNode it holds, if any return themselves, then that is returned, otherwise null.
  4. If the node is a BitmapIndexedNode, it first checks to see if it has a node where the five bits being used matches and then calls find() on it, if it doesn’t, it returns null.
  5. If the node is a FullNode, it acts like a BitmapIndexedNode with all bits of the bitmap set, so it just calls find() on the node corresponding the five bits in the hash at its level.


The data structure is persistent because it never changes constructed nodes. During an assoc, all non-traversed paths will be shared, and only the path that changed will be new.

I should also mention that FullNode is purely an optimization — a BitmapIndexedNode could be used in its place, but it saves some branch calls to have them sprinkled throughout the map. Also, using a bitmap in BitmapIndexedNode is also an optimization. An INode array of 32 elements could have always been constructed with null signifying that there was no nodes in that branch, but that would be very wasteful in space, and so by using the bitmap, BitmapIndexedNodes have an INode array of the exact length they need.

20 Days of Clojure: Day 7

Ok, I’m going to try to describe how vectors are implemented in clojure. The implementation is in:

   src/jvm/clojure/lang/PersistentVector.java

if you want to see the code. I’m going to try to go through the important bits here.

The PersistentVector object has three members: two ints named cnt and shift, and an Object[] named root. A simple PersistentVector built from this line:

   (def v1 [1 2 3])

would look like this (using curly braces for root because it is a Java array):

   cnt: 3
   shift: 0
   root: { 1, 2, 3 }

cnt is the count of elements and (count v1) simply returns it and is O(1) — I’ll explain shift later, and root is the object array. When shift is 0, this line:

   (nth v1 1)

Just simply resolves to root[1], and returns 2, and in this simple case is also O(1). If I do this:

   (def v2 (assoc v1 1 4))

which returns a new vector, but with the value at index 1 set to 4, you get another PersistentVector that looks like this:

   cnt: 3
   shift: 0
   root: { 1, 4, 3 }

The 1 and the 3 are shared between the v1 and v3 array. If I do this:

   (conj v1 5)

I’ll get yet another PersistentVector that looks like this:

   cnt: 4
   shift: 0
   root: { 1, 2, 3, 5 }

with the 1, 2, and 3 shared with v1 (and the 1 and 3 shared with v2). This is all very simple until, you conj onto a vector of 32 elements. When the root array has 32 elements, then adding one more element (33) returns a new PeristentVector that looks like this (assume the starting array had 1, 2, 3, … 32)

   cnt: 33
   shift: 5
   root: { {1, 2, 3, …, 32 }, { 33 } }

Root is a Java Object[] with the 0th element set to the Object[] from the input vector to conj (not a clone, but the actual one), and the next element is an array of just the new value. If I conj onto that, I get a new Vector:

   cnt: 34
   shift: 5
   root: { {1, 2, 3, …, 32 }, { 33, 34 } }

Now, I can explain shift. When a method that takes an index is called, a bit shift is done on it to determine how many levels deep in the structure we need to go to get to the part of the datastructure that has the element at that index. For instance, here is the main Java for nth(i):

      Object[] arr = root;
      for(int level = shift; level > 0; level -= 5)
         arr = (Object[]) arr[(i >>> level) & 0x01f];
      return arr[i & 0x01f];

so, when i < 32, then i >>> level is 0, and arr will be root[0] (the array of 32 elements). Then we return arr[i & 0x01f] (which is i % 32), to get the ith element in that array.

When i == 32, then (i >>> level) is 1, arr is root[1], and then we return arr[i%32] which is the 0th element. Now, if I do an assoc to set the 0th element to 100, I get this PersistentVector:

   cnt: 34
   shift: 5
   root: { {100, 2, 3, 4, …, 32 }, { 33, 34 } }

assoc calls a recursive function (doAssoc). First it clones root so that the new root is an array of two elements, each an object array. Then it determines that index 0 is in the 0th branch and does an assoc on that, decrementing the shift by 5 and setting the index to (index % 32). This recursive call clones the array at root[0]. Since shift is now 0, it is at the base case of the recursion, and so it sets root[0][0] to 100. All of the numbers and the entire array at root[1] is shared with the starting vector. Here is the Java code for that (doAssoc is called with the arguments shift, root, the index, and the new value) — the return is the new root:

   private static Object[] doAssoc(int level, Object[] arr, int i, Object val){
      Object[] ret = arr.clone();
      if(level == 0)
         {
         ret[i & 0x01f] = val;
         }
      else
         {
         int subidx = (i >>> level) & 0x01f;
         ret[subidx] = doAssoc(level – 5, (Object[]) arr[subidx], i, val);         }
      return ret;
   }

By now, you might have realized that we are building up a tree. If I keep conjing on numbers in order starting at 1, eventually I will conj on 65. If so, I get this

   cnt: 65
   shift: 5
   root: { {1, 2, 3, 4, …, 32 }, { 33, 34, 35, … 64 }, { 65 } }

Graphically, it looks like this (for 1..32 elements)

+————-+
|{1,2,3,…32}|
+————-+

and then for up to 322, it looks like this:

             +——-+
             | {   } |
             +–+-+–+
                | |
       +——–+ +———-+
       |                     |
+——+——+     +——–+——-+
|{1,2,3,…32}|     |{33,34,35,…64}|
+————-+     +—————-+

And, we get another level at 322+1 (1025) elements, and another at 32k+1. This explains how index access is O(log32N). Another important point, is that not only is a PersistentVector immutable from an interface perspective, but it’s internally immutable, and each individual Object[] that makes up the tree is never changed in a constructed vector. If it needs to change in a new vector, it is cloned and the changed clone is used in the new PersistentVector — the benefit of this is that the vector never needs to lock. This may seem like a small point, but I believe that this is not a requirement of Persistence — but this extra work makes vector very suitable for concurrent applications.

This implementation is different from the one described in the Okasaki paper from yesterday. Rich appears to be trading away O(1) consing for better index performance, which was O(log2N) in that paper.

(clojure day (March 20) in Northampton, MA is coming soon)

Update: Made a few corrections

20 Days of Clojure: Day 6

So far, I’ve been working on the “clojure is a dialect of Lisp” part of the description of clojure. One of the things Lisps allow is purely functional programming, but to enforce it you need immutable datatypes. One of the things that bothered me as a C programmer, is that I could never understand how it worked for anything more complicated than a string. Then I saw persistent data-structures. Their existence in clojure is what made me think that purely functional programming could work in real programs.

The clojure page on data structures, but more importantly check out clojure TV on collections.

If you’re interested in how these data structures are implemented, you want to read Okasaki. I would start here with his description of a persistent random access list, which I think is one of the more accessible.

Tomorrow, I’m going to try to expand on this by describing how vectors are implemented in clojure.

20 Days of Clojure: Day 5

This SICP lecture shows some of the power of Lisp by building a symbolic differentiator in a very small amount of code, and is also the first one to introduce the idea that code is data and data is code (watch from the beginning to 29:00 — at 13:00 he starts talking about using Lisp as the data format):

Here is the first symbolic differentiator he writes (but, in clojure)

    (defn atom? [x]
        (not (instance? clojure.lang.IPersistentCollection x)))

    (defn const? [expr val] (and (atom? expr) (not= expr val)))
    (defn same-var? [expr val] (and (atom? expr) (= expr val)))
    (defn arith? [expr sym] (and (not (atom? expr)) (= (first expr) sym)))
    (defn sum? [expr] (arith? expr ‘+))
    (defn prod? [expr] (arith? expr ‘*))

    (defn a1 [expr] (frest expr))
    (defn a2 [expr] (first (rrest expr)))
    (defn m1 [expr] (frest expr))
    (defn m2 [expr] (first (rrest expr)))

    (defn deriv [expr val] (
        cond
            (const? expr val) 0
            (same-var? expr val) 1
            (sum? expr)
                (list ‘+ (deriv (a1 expr) val) (deriv (a2 expr) val))
            (prod? expr)
                (list ‘+ (list ‘* (m1 expr) (deriv (m2 expr) val))
                    (list ‘* (m2 expr) (deriv (m1 expr) val))
                )
            (:else) nil
        )
    )

The things to note:

  1. Quote syntax, which uses ‘
  2. first and rest which are car and cdr. Clojure also supports combinations like frest (cadr) and rrest (cddr)
  3. In clojure, extra parens are usually not necessary — look at how cond is implemented to use variable arguments rather than a single list argument (which would require extra parens)
  4. My first usage of Java interop (using instance? to find out if an expression is atomic or a collection)

20 Days of Clojure: Day 4

Continuing from yesterday, we made our first function that takes a function and returns a function. Go to 43:00 in this video to see why (you can stop at 54:00):

In this example, we take the first-order functions idea further and define a derivative function (that takes and returns functions) and a newton function (which takes a function and uses fixed-point). Here it is in clojure (you can find fixed-point and other functions in yesterday’s post):

    (defn square [x] (* x x))

    (defn deriv [f]
        (let [dx 0.000001]
            (fn [x] (/ (- (f (+ x dx)) (f x)) dx))))

    (defn newton [f guess]
        (let [df (deriv f)]
            (fixed-point (fn [x] (- x (/ (f x) (df x)))) guess)))

    (defn sqrt [x] (newton (fn [y] (- x (square y))) 1))

(clojure day (March 20) in Northampton, MA is coming soon)

20 Days of Clojure: Day 3

Ok, so first we made a very hard-coded sqrt, then we improved it by adding the concept of fixed points, which made the algorithm easier to see in the code. Here’s how we can improve it even further. Go to 32:30 in this video (stop at the break at 43:00):

First we define an average-damp function. This function takes a function as its argument and returns another function which takes a number and returns the average of that number and result of calling the function argument on the number. It’s easier to say that in clojure:

    (defn average-damp [f]
        (fn [x] (average x (f x))))

f is a function, average-damp creates a new function that takes x (a number) and that new function averages x and f(x).

Now that I have that, I define sqrt like this (you can find average and fixed-point in yesterday’s post):

    (defn sqrt [x] (fixed-point (average-damp (fn [y] (/ x y))) 1))

Tomorrow, we’ll use this to implement another way of finding sqrt.

20 Days of Clojure: Day 2

Yesterday, we built a simple sqrt using Heron of Alexandria’s algorithm. The problem with the code is that it had some generically useful ideas, but it was all hard-coded to only work with sqrt. Also, it was not clear what exactly the algorithm was.

In the SICP lecture they fix it by passing functions around since Lisp treats functions as first-class data-types. Here’s the fix explained (23:30 in the video, stop at 32:30):

See the video for the Scheme implementation. Here it is in clojure.

    (defn sum ([] 0)
        ([x] x)
        ([x & more] (+ x (apply sum more) )))

    (defn average
        ([] 0)
        ([x] x)
        ([x & more]
            (/ (+ x (apply sum more)) (+ 1 (count more)))))

    (defn abs [x] (if (< x 0) (- x) x))     (defn close-enough? [u v]
        (let [tolerance 0.00001]
            (< (abs (- u v)) tolerance)))     (defn fixed-point [f start]
        (loop [old start new (f start)]
            (if (close-enough? old new)
                new
                (recur new (f new)))))

    (defn sqrt [x] (fixed-point (fn [y] (average (/ x y) y)) 1 ))

    (prn (sqrt 9))

A fixed point of a function f is a value y such that f(y) = y. This improved code, which implements the same algorithm as before, uses a generic fixed-point function that takes a function and a guess and tries to find the fixed point by setting the next guess to f(guess). You can see it all in the video. The things in clojure you need to know:

  1. fn [arg1 arg2] is lambda (λ), so where fixed-point takes a function as the first argument, we create one on the fly to be (average (/ x y) y) where x is the argument to sqrt (closed around our function) and y is the argument defined by fn.
  2. Since clojure compiles to Java and uses Java calling conventions, it does not have automatic tail-recursion. Instead, you use the recur special operator which will create a constant space loop back to the last loop or function definition. Loop/recur is explained on the Functional Programming page of the clojure site (at the bottom).

But, even this can be improved, as we’ll see tomorrow.

20 Days of Clojure: Day 1

Ok, this might be hard to keep up, but nothing ventured …

Rich Hickey is going to be coming to Northampton on March 20th to talk about clojure. In preparation for that, I am going to blog every day in March about clojure — then, hopefully we can post the results of his talk.

If you don’t have clojure yet — go get it. Start here to learn basic clojure syntax (there is also a download link). Clojure is trivial to get running and you definitely want to play with it a little to understand the rest of this. The quick description of clojure is that it’s a dialect of Lisp that runs on the JVM. It has support for immutable and persistent data-structures and a concurrent programming model.

Today I’m going to start with some of the work I did to learn clojure through translating SICP lectures. In the first lecture, the first interesting program starts at 56:20 in this video:

Starting there presumes you know some basic Lisp syntax. This is the program he eventually writes to calculate the square root (in Scheme):

    (define (improve guess x)
        (average guess (/ x guess)))

    (define (good-enough? guess x)
        (< (abs (- (square guess) x)) .001))

    (define (try guess x)
        (if (good-enough? guess x)
            guess
            (try (improve guess x) x)))

    (define (sqrt x) (try 1 x))

I have left out implementations for square and average here. This code and the algorithm is explained in the video. Here it is in clojure (I defined average and square — you could just use Java, but I am trying to keep it all in clojure for now)

    (defn sum ([] 0)
        ([x] x)
        ([x & more] (+ x (apply sum more) )))

    (defn average
        ([] 0)
        ([x] x)
        ([x & more]
            (/ (+ x (apply sum more)) (+ 1 (count more)))))

    (defn abs [x] (if (< x 0) (- x) x))     (defn square [x] (* x x))

    (defn improve [guess x]
        (average guess (/ x guess)))

    (defn good-enough? [guess x]
        (< (abs (- (square guess) x)) 0.001))     (defn try-sqrt [guess x]
        (if (good-enough? guess x)
            guess
            (try-sqrt (improve guess x) x)))

    (defn sqrt [x] (try-sqrt 1 x))

    (prn (sqrt 9))

If you take this and put it in a file named sqrt.clj, you can run it like this (after downloading clojure):

    java -cp clojure.jar clojure.lang.Script sqrt.clj

And it will print out:

    65537/21845

Explanation of the differences between the two examples:

  1. clojure uses defn to define functions and puts arguments in []
  2. sum and average implementations show how clojure can overload on arity (number of arguments)
  3. try is reserved for exception handling, so I renamed the try function to try-sqrt
  4. the answer is in clojure’s rational number format, since all of the computation was adding and averaging integers and rationals, the result is a rational number and clojure preserves that.

Tomorrow, I’ll improve this so to make it more generic. A lot of what I used in this can is explained in the clojure Functional Programming support page.

Book Review: Collective Intelligence: Part I

51AnWLR89xL._AA240_

I’m at about the halfway point in Programming Collective Intelligence by Toby Segaran, and even if the second half was blank, I’d be telling everyone to read this book. It’s been a while since I’ve read a technical book this good. It excels because of a combination of three factors:

  1. Useful algorithms
  2. Shown in real-life situations
  3. With real data that you can get yourself

The basic idea of the book is to show different ways of using user-generated datasets to extract information and make decisions. Each chapter tackles a different problem and shows a few ways to approach it. All of the examples are in Python, but even if you don’t know Python, you should be fine if you know any structured language. Practically every example either uses some dataset that he tells you how to get, or a public API that is free to access.

This book is the first I’ve read that relies on the current maturity level of the web. A few years ago, it wouldn’t have been possible to provide so many relevant examples. In a few years, the API’s might be out-of-date — but don’t let that stop you — the algorithms are timeless.

The choice of Python is a good one for a few reasons. Firstly, it’s pretty clear to read even if you don’t know it. Secondly, list-comprehensions are a pretty good way to think about data crunching (which these algorithms do a lot), and, finally, there are many freely available Python libraries to deal with the public APIs he uses.

To give you an idea of Toby’s approach, I’m going to run through my favorite chapter so far, the one on Optimization. The quick description of the problem domain should be familiar to every programmer — solve NP complete problems like the traveling salesperson. Basically, Optimization techniques can be used in situations where (1) you can’t analytically solve the problem (2) the solution set is too large to exhaustively search (3) you can express the cost of a possible solution such that better solutions have a lower cost than worse ones and (4) solutions that are “near” each other have similar values for cost.

Toby’s first example is how a family can choose flights so that they arrive at the same airport from all over the country at nearly the same time and depart for home a few days later at nearly the same time. The cost function is a combination of the actual costs of the tickets and the difference in time between the earliest and latest flights. He goes on to provide a dataset of flights, describes setting up optimization problems in general, and shows the python code to implement the cost function.

Then he introduces a few optimization algorithms, implements them, and discusses their relative strengths and weaknesses. The next section applies these techniques to real data on Kayak.com using the standard Python DOM parser, minidom. The final section shows how to apply these techniques to other problems like pairing students in a dorm and network visualization. Other chapters get a similarly exhaustive treatment.

I really hope that this is the future of technical books.

Django, WebFaction, RapidWeaver and Christmas

Atalasoft closes for the week between Christmas and New Year’s, so I have time to catch up on some things I’ve been wanting to do — namely move this site to WebFaction. I chose WebFaction once I decided that I wanted to learn Django (after reading this great comparison of Django and Rails). Following along the forums, there were more than a few good mentions of WebFaction as a good Django host.

A couple of weeks into Django and a couple of days into WebFaction, so far, I’m pretty happy. It’s Christmas today, so even though I had some questions for WebFaction, I really didn’t expect them to get back to me — so I was pretty happy that they do — I guess I should expect 365 24×7 from an ISP, but I hadn’t been getting it before.

In addition to new hosting and a new webapp framework — I’ve also settled on RapidWeaver for content management. I moved to Mac for home a couple of years ago, and I still hadn’t found a good desktop CMS that was as good as CityDesk on the PC. RapidWeaver has beautiful themes built in, and a lot of the features that I’ve gotten used to. The built-in blog is nice (would be nicer with an import from RSS), but it didn’t take too long to transfer everything over.

All of this work is to get this site going again.