20 Days of Clojure: Day 8

Rich Hickey comments on yesterday’s post on the clojure google group:

An interesting point of comparison with purely-functional solutions such as that which Okasaki writes about is that the Clojure vector can be log32N specifically because I can use mutation – when cloning, i.e. the tree is log32N deep and a node can be copied in O(1) time due to arraycopy + indexed array set. And yes, the vectors are optimized for lookup – in practice, log32N blows logN out of the water.

Exactly — PersistentVector’s internals are not purely functional. New objects (during construction only) make an array copy of Object arrays from existing vectors, then make changes to the copied array before returning. As Rich points out, this makes some operations O(1). And for comparison, a clojure vector of 32k elements is only 2 levels deep, but if implemented with a log2N structure would be 15 deep (with average access in maybe 7 operations). Since access by index is a very common operation, you can see that this would make a big difference.

Today, I want to look at PersistentHashMap.java which is the implementation of clojure hash-maps. There are actually four implementations of the map interface (hash-map, sorted-map, struct-map, and array-map), but hash-map is what the reader will give you for when you build up a map with {}.

A comment on the top of the source file says that it’s a persistent version of Phil Bagwell’s Hash Array Mapped Trie. According to that paper, it looks like the bits of hashes of the key are used to traverse a trie. Let’s see how it’s done in clojure.

The first thing I like to look at is how lookup is done, that gives me a sense of the structure. So, to start with, here’s entryAt:

 public IMapEntry entryAt(Object key){
  return root.find(RT.hash(key), key);

Simple enough — it uses the root node to start the traversal. The root node is an INode, and clojure uses polymorphism to get different behavior for different nodes. There are five types of nodes,
EmptyNode, LeafNode, HashCollisionNode, BitmapIndexedNode, and FullNode. An empty hash-map has its root set to an instance of EmptyNode, which is an implementation of the Null Object design pattern, and find() on it returns null (by the way, the entire structure is a Composite design pattern, and by using each node type to handle part of traversal, I guess he’s also using a Chain of Responsibility — I mention this because there are many OO design patterns used in the Java implementation of clojure — some by name, so if you are going to look at the code, you’ll want to be familiar with these patterns).

If I assoc() onto a hash-map, root.assoc() is called which is supposed to return a new root. EmptyNode.assoc() simply creates a LeafNode with the hash, key and value, so a map with one key has its root set to a LeafNode.

If I call find() on a LeafNode, that’s pretty simple too. If this is the node I am looking for, it returns this, otherwise null.

If I assoc onto a LeafNode, here is what can happen:

  1. If the key and value match the LeafNode, then it is just returned — nothing is created since you haven’t changed anything.
  2. If just the key is the same as this node, a new LeafNode is created with the same key and the new value and returned (remember, this is a persistent data-structure, so we do not change the LeafNode, we create new nodes to build up a new hash-map).
  3. If the hash is the same, but the key is different, then a new HashCollisionNode is created that holds onto this LeafNode and a new one that has the same key and the new value. HashCollisionNodes simply hold onto an array of LeafNodes whose keys have the same hash.
  4. If the new key doesn’t have the same hash (the common case), then a BitmapIndexedNode will be created. At first it contains just the old LeafNode in its array of nodes, and then assoc() is called on it with the new key and value.

Case #4 is where we start to build up the trie. The hash of the key is going to be used to traverse the trie later, and a BitmapIndexedNode knows if it contains the LeafNode you are looking for. A BitmapIndexedNode has a bitmap of 32 bits. The bit corresponding to the bottom five bits of the hash of the LeafNode is turned on to indicate that the LeafNode is to be found in this branch (it can have 32 branches). For example if the bottom five bits of the key hash are 01100, then the 12th bit of the bitmap is turned on, and the node is put at this branch.

If the bitmap bit for the new leaf node is off, then a new BitmapIndexedNode is created with a new bitmap with the new bit turned on, and the node is put in its array which will be one bigger than the one we started with. If we kept doing this and hit each bit, the 32nd assoc would result in a FullNode — which is a simple case of a BitmapIndexedNode, except it knows that if it had a bitmap, all bits would be set. Again, remember, we create a new node because we never change nodes in a persistent data structure.

If the new leaf node has a bottom five bits that match a node already in this BitmapIndexedNode (the corresponding bit is already turned on), then we assoc() this key/value onto the node with the same five bits.

I haven’t mentioned this yet, but assoc takes a shift argument — so far, that has always been 0 and that meant to use the bottom five bits (use a 0 shift) of the hash in any BitmapIndexedNodes. On this assoc call, we’re now going to pass in (shift+5) so that the next 5 bits will be used on any BitmapIndexedNodes created under here. Later when we traverse, we’ll use the bottom five bits on the first BitmapIndexedNode we see, then the next five bits on the next one and so on. When we’ve used up all 32 bits of the hash, we should be at our LeafNode or a HashCollisionNode (meaning all 32 bits matched).

So here’s find():

  1. If the node is an EmptyNode, it returns null.
  2. If the node is a LeafNode, it returns itself if the key matches, otherwise null.
  3. If the node is a HashCollisionNode, it calls find() on each LeafNode it holds, if any return themselves, then that is returned, otherwise null.
  4. If the node is a BitmapIndexedNode, it first checks to see if it has a node where the five bits being used matches and then calls find() on it, if it doesn’t, it returns null.
  5. If the node is a FullNode, it acts like a BitmapIndexedNode with all bits of the bitmap set, so it just calls find() on the node corresponding the five bits in the hash at its level.

The data structure is persistent because it never changes constructed nodes. During an assoc, all non-traversed paths will be shared, and only the path that changed will be new.

I should also mention that FullNode is purely an optimization — a BitmapIndexedNode could be used in its place, but it saves some branch calls to have them sprinkled throughout the map. Also, using a bitmap in BitmapIndexedNode is also an optimization. An INode array of 32 elements could have always been constructed with null signifying that there was no nodes in that branch, but that would be very wasteful in space, and so by using the bitmap, BitmapIndexedNodes have an INode array of the exact length they need.