20 Days of Clojure: Day 7 | Lou Franco: code, apps, and writings

Ok, I’m going to try to describe how vectors are implemented in clojure. The implementation is in:

src/jvm/clojure/lang/PersistentVector.java

if you want to see the code. I’m going to try to go through the important bits here.

The PersistentVector object has three members: two ints named cnt and shift, and an Object[] named root. A simple PersistentVector built from this line:

(def v1 [1 2 3])

would look like this (using curly braces for root because it is a Java array):

   cnt: 3
   shift: 0
   root: { 1, 2, 3 }

cnt is the count of elements and (count v1) simply returns it and is O(1) — I’ll explain shift later, and root is the object array. When shift is 0, this line:

(nth v1 1)

Just simply resolves to root[1], and returns 2, and in this simple case is also O(1). If I do this:

(def v2 (assoc v1 1 4))

which returns a new vector, but with the value at index 1 set to 4, you get another PersistentVector that looks like this:

   cnt: 3
   shift: 0
   root: { 1, 4, 3 }

The 1 and the 3 are shared between the v1 and v3 array. If I do this:

(conj v1 5)

I’ll get yet another PersistentVector that looks like this:

   cnt: 4
   shift: 0
   root: { 1, 2, 3, 5 }

with the 1, 2, and 3 shared with v1 (and the 1 and 3 shared with v2). This is all very simple until, you conj onto a vector of 32 elements. When the root array has 32 elements, then adding one more element (33) returns a new PeristentVector that looks like this (assume the starting array had 1, 2, 3, … 32)

   cnt: 33
   shift: 5
   root: { {1, 2, 3, …, 32 }, { 33 } }

Root is a Java Object[] with the 0th element set to the Object[] from the input vector to conj (not a clone, but the actual one), and the next element is an array of just the new value. If I conj onto that, I get a new Vector:

   cnt: 34
   shift: 5
   root: { {1, 2, 3, …, 32 }, { 33, 34 } }

Now, I can explain shift. When a method that takes an index is called, a bit shift is done on it to determine how many levels deep in the structure we need to go to get to the part of the datastructure that has the element at that index. For instance, here is the main Java for nth(i):

      Object[] arr = root;
      for(int level = shift; level > 0; level -= 5)
         arr = (Object[]) arr[(i >>> level) & 0x01f];
      return arr[i & 0x01f];

so, when i < 32, then i >>> level is 0, and arr will be root[0] (the array of 32 elements). Then we return arr[i & 0x01f] (which is i % 32), to get the ith element in that array.

When i == 32, then (i >>> level) is 1, arr is root[1], and then we return arr[i%32] which is the 0th element. Now, if I do an assoc to set the 0th element to 100, I get this PersistentVector:

   cnt: 34
   shift: 5
   root: { {100, 2, 3, 4, …, 32 }, { 33, 34 } }

assoc calls a recursive function (doAssoc). First it clones root so that the new root is an array of two elements, each an object array. Then it determines that index 0 is in the 0th branch and does an assoc on that, decrementing the shift by 5 and setting the index to (index % 32). This recursive call clones the array at root[0]. Since shift is now 0, it is at the base case of the recursion, and so it sets root[0][0] to 100. All of the numbers and the entire array at root[1] is shared with the starting vector. Here is the Java code for that (doAssoc is called with the arguments shift, root, the index, and the new value) — the return is the new root:

   private static Object[] doAssoc(int level, Object[] arr, int i, Object val){
      Object[] ret = arr.clone();
      if(level == 0)
         {
         ret[i & 0x01f] = val;
         }
      else
         {
         int subidx = (i >>> level) & 0x01f;
         ret[subidx] = doAssoc(level – 5, (Object[]) arr[subidx], i, val);         }
      return ret;
   }

By now, you might have realized that we are building up a tree. If I keep conjing on numbers in order starting at 1, eventually I will conj on 65. If so, I get this

   cnt: 65
   shift: 5
   root: { {1, 2, 3, 4, …, 32 }, { 33, 34, 35, … 64 }, { 65 } }

Graphically, it looks like this (for 1..32 elements)

+————-+
|{1,2,3,…32}|
+————-+

and then for up to 32², it looks like this:

             +——-+
             | {   } |
             +–+-+–+
                | |
       +——–+ +———-+
       |                     |
+——+——+     +——–+——-+
|{1,2,3,…32}|     |{33,34,35,…64}|
+————-+     +—————-+

And, we get another level at 32²+1 (1025) elements, and another at 32k+1. This explains how index access is O(log₃₂N). Another important point, is that not only is a PersistentVector immutable from an interface perspective, but it’s internally immutable, and each individual Object[] that makes up the tree is never changed in a constructed vector. If it needs to change in a new vector, it is cloned and the changed clone is used in the new PersistentVector — the benefit of this is that the vector never needs to lock. This may seem like a small point, but I believe that this is not a requirement of Persistence — but this extra work makes vector very suitable for concurrent applications.

This implementation is different from the one described in the Okasaki paper from yesterday. Rich appears to be trading away O(1) consing for better index performance, which was O(log₂N) in that paper.

(clojure day (March 20) in Northampton, MA is coming soon)

Update: Made a few corrections

Related Posts: