This monograph (not really a textbook, although there are minimal exercises at the end of each chapter) is aimed at professionals in business, government, or other professions who might be thrust into a data mining/statistical analysis project with no knowledge of the field. The writing is very mathematically at a very low level, so the book should be accessible to all its intended readers. The intent seems mostly to make readers conversant with what might be involved in such a project, rather than make them technical experts.
The organization of the book mirrors the steps one should take in such a project: problem definition, preparation of data, constructing tables and graphs, doing statistics, grouping observations into like clusters and forming associative rules, making predictions (via models, neural nets, etc) and finally producing a report or other such deliverable and implementing recommendations. At each step, an overview of what might be undertaken is presented, not an exhaustive catalog of all possible analyses. The details of data cleaning (for example), which is acknowledged to be “one of the most timeconsuming parts of a data analysis,” are mercifully left to the experts — technical (IT types) and those with specific subject matter expertise to make necessary judgments.
The biggest strength of the book is a good description of the logic behind such complex topics as cluster analysis, decision/classification and regression trees and neural networks, easily accessible to a nontechnician. The targeted “manager” will understand the basic concept, while details are left to the experts.
There are some errors and some (to this reviewer major) omissions in the text, however. While discussing ttests for comparing two means, the author describes only the pooled test. While probably correct for his example, this is against current wisdom — with technology, it’s safer to always use the unpooled test. He states that the hypotheses in an analysis of variance are that the sample means are/are not equal (the test concerns the population means). He also states (p.92) that if “r is around 0 then there appears to be little or no relationship between the variables.” This last statement gives rise to much bad statistics — a correlation around 0 simply means there is little or no linear relationship between the variables. One must always plot the data; many curved relationships give correlations around 0. In fact, he examines a linear correlation for variables that exhibit a definite curved relationship.
Now to the major omissions. While correctly stating that a chisquared test of association does not tell you what sort of relationship might be present, he fails to go back to the table and examine the contributions to the statistic and the expected values; these do give a clear indication of the relationship. The author focuses solely on simple linear regression (or transforming data to linearize it); in most data mining situations, the focus will be on building a multiple regression model, perhaps some form of logistic regression or other categorical model, etc.
Lastly, some of his references (Agresti for categorical data analysis, Kleinbaum et al for applied regression analysis, for example) will be inaccessible to the target audience.
Patricia Humphrey is an Associate Professor of Statistics in the Department of Mathematical Sciences at Georgia Southern University. She has been a member of Project NExTSE since 1998, and has served as a coleader for Section NExT for several years. She is ChairElect of the SIGMAAStatEd for 2007. She is the author of several ancillary technology manuals for introductory statistics.
