The aim of this book is to show how to carry out analyses of data sets — most particularly, computationally intensive analyses of very large data sets. It is split about half and half between data-oriented computing (using primarily C and the SQL database query language) and statistics. It is not an introductory statistics book. Instead, it is directed toward graduate students or independent researchers as a supplement to the standard first-year statistics texts.
That is not to say that the author does not have strong feelings about teaching statistics. Consider this, from the first chapter: “Statistics has two goals, which directly conflict. The first is to find patterns in static... The second goal is a fight against apophenia, the human tendency to invent patterns in random static.”
As a consequence, the author strongly favors separating the teaching of descriptive statistics from inferential statistics. This is a kind of by-the-way, but it is thought-provoking to those who have taught statistics and pondered the universe of confusions that inferential statistics can create in the minds of students. The author practices his belief by concentrating first on descriptive statistics and emphasizing how one can build statistical models and gain considerable understanding without resorting to inferential tests. There is nothing exotic or trendy here: rarely does the author need anything beyond ordinary least squares, maximum likelihood estimation and bootstrapping. But he makes good and creative use of the basic tools.
The half of the book that focuses on data-oriented computing spends a fair amount of time on teaching the basics of the C language. The author does this competently and with humor, but it will never make for fascinating reading. C code is prominent throughout the book, but things get more interesting once we get to the statistical modeling and see some examples.
The book would benefit from more extended examples and perhaps a more detailed case study or two. Where the author shines is his common sense and the practical tips he offers along the way. I have never seen a better short summary of the common probability distributions than the one that appears on page 235 with the heading “Every probability distribution tells a story.”
This is not a book for everyone. However, if you or your students are interested in getting down and dirty with massive amounts of data, and writing code to make sense of it, then this would be a great book for you or for them.
Bill Satzer ([email protected]) is a senior intellectual property scientist at 3M Company, having previously been a lab manager at 3M for composites and electromagnetic materials. His training is in dynamical systems and particularly celestial mechanics; his current interests are broadly in applied mathematics and the teaching of mathematics.