Monthly Archives: January 2008

Book Review: Collective Intelligence: Part I

51AnWLR89xL._AA240_

I’m at about the halfway point in Programming Collective Intelligence by Toby Segaran, and even if the second half was blank, I’d be telling everyone to read this book. It’s been a while since I’ve read a technical book this good. It excels because of a combination of three factors:

  1. Useful algorithms
  2. Shown in real-life situations
  3. With real data that you can get yourself

The basic idea of the book is to show different ways of using user-generated datasets to extract information and make decisions. Each chapter tackles a different problem and shows a few ways to approach it. All of the examples are in Python, but even if you don’t know Python, you should be fine if you know any structured language. Practically every example either uses some dataset that he tells you how to get, or a public API that is free to access.

This book is the first I’ve read that relies on the current maturity level of the web. A few years ago, it wouldn’t have been possible to provide so many relevant examples. In a few years, the API’s might be out-of-date — but don’t let that stop you — the algorithms are timeless.

The choice of Python is a good one for a few reasons. Firstly, it’s pretty clear to read even if you don’t know it. Secondly, list-comprehensions are a pretty good way to think about data crunching (which these algorithms do a lot), and, finally, there are many freely available Python libraries to deal with the public APIs he uses.

To give you an idea of Toby’s approach, I’m going to run through my favorite chapter so far, the one on Optimization. The quick description of the problem domain should be familiar to every programmer — solve NP complete problems like the traveling salesperson. Basically, Optimization techniques can be used in situations where (1) you can’t analytically solve the problem (2) the solution set is too large to exhaustively search (3) you can express the cost of a possible solution such that better solutions have a lower cost than worse ones and (4) solutions that are “near” each other have similar values for cost.

Toby’s first example is how a family can choose flights so that they arrive at the same airport from all over the country at nearly the same time and depart for home a few days later at nearly the same time. The cost function is a combination of the actual costs of the tickets and the difference in time between the earliest and latest flights. He goes on to provide a dataset of flights, describes setting up optimization problems in general, and shows the python code to implement the cost function.

Then he introduces a few optimization algorithms, implements them, and discusses their relative strengths and weaknesses. The next section applies these techniques to real data on Kayak.com using the standard Python DOM parser, minidom. The final section shows how to apply these techniques to other problems like pairing students in a dorm and network visualization. Other chapters get a similarly exhaustive treatment.

I really hope that this is the future of technical books.