Practitioners of exploratory data analysis who use MATLAB will want a copy of this book. Exploratory data analysis (EDA) involves trying to discover structure in data. The authors discuss many EDA methods, including graphical approaches. With the book comes the EDA Toolbox (downloadable from the text website) for use with MATLAB. It contains code for all of the algorithms discussed in the text.
The authors do not attempt to provide thorough theoretical underpinnings for the algorithms. The focus is on praxis. Yet the book is not merely a catalogue of routines. On the contrary, each method is introduced with an explanation of the ideas behind it. Equations used in the algorithm are often derived or at least explained. Then an example using real data illustrates the algorithm.
To get a feel for the exposition, let us observe how two methods, hierarchical clustering and projection pursuit, are handled. First consider hierarchical clustering. The introduction to the method explains, among other things, the difference between agglomerative clustering and divisive clustering. The authors choose to cover only the former. They justify this decision by stating that divisive clustering is computationally expensive, but they add two qualifications to this assertion along with relevant pointers to the literature. The authors proceed to explain the gist of agglomerative clustering and describe the steps of the algorithm. Next follows a list of various metrics for measuring the distance between clusters, with a mention of the pros and cons of each metric. To round off the discussion of the algorithm, dendrograms are introduced as a sensible way to display cluster structure. (Dendrograms will receive a full treatment in the final unit of the book in which all of the graphical techniques are covered together.)
Finally an example illustrates the clustering. Some genetic data from yeast (in the form of a 384-by-17 matrix available from the text website) is clustered, and the resulting structure is displayed in a dendrogram. The dendrogram is unbalanced, and the authors note that this is a not uncommon result for the metric they used. So they redo the clustering using another metric, and a balanced dendrogram appears. This is a good illustration of how the authors strategically inject helpful observations and guidance into the examples throughout the book.
Examples are crafted to fit the demands of the algorithms they illustrate. The code provided in the clustering example, for instance, is only a few lines long. A much more drawn out example accompanies the discussion of projection pursuit, the core of which is implemented via the EDA Toolbox function ppeda. Several pages are spent explaining the ideas and building up the algorithm. Then an example is presented using another data set available from the text website, this one involving the grain sizes from various sand samples. The function ppeda requires five inputs, and some preprocessing is needed before calling it. So the authors present MATLAB code fragments that set typical values for the inputs. They also present code that spheres the data, which is necessary prior to calling ppeda. After the data has been readied and all of the inputs determined, ppeda is finally called. Because the ppeda function itself includes no plotting, the authors provide a final code fragment that produces scatterplots of the data. The example is capped with a figure containing scatterplots of a couple projections of the sand data.
Although theory is not the focus of the text, various summaries provided along with way are helpful, especially those at the beginning of chapters and sections. These help to orient especially the reader whose background in statistics and data analysis is less than comprehensive. Here is the opening of Chapter 7 on smoothing scatterplots:
In many applications, we might make distributional and model assumptions about the underlying process that generated the data, so we can use a parametric approach in our analysis. The parametric approach offers many benefits (known sampling distributions, lower computational burden, etc.), but it can be dangerous if the assumed model is incorrect. At the other end of the data-analytic spectrum, we have the non-parametric approach, where one does not make any formal assumptions about the underlying structure or process. When our goal is to summarize the relationship between variables, then smoothing methods are a bridge between these two approaches. Smoothing methods make the weak assumption that the relationship between variables can be represented by a smooth curve or surface. In this chapter, we cover several methods for scatterplot smoothing, as well as some ways to assess the results.
A beginner could even pick up on some statistics praxis simply by reading this book. Here is some of the authors’ guidance on checking assumptions:
We can construct a normal probability plot to determine whether the normality assumption is reasonable… [It] can be used when we have just one predictor or many predictors. One could also construct a histogram of the residuals to visually assess the distribution.
To check the assumption of constant variance, we can plot the absolute value of the residuals against the fitted values…Here we expect to see a horizontal band of points with no patterns or trends.
Finally, to see if there is bias in our estimated curve, we can graph the residuals against the independent variables, where we would also expect to see a horizontal band of points. This is called a residual dependence plot [Cleveland, 1993]. If we have multiple predictors, then we can construct one of these plots for each variable. As the next example shows, we can enhance the diagnostic power of these scatterplots by superimposing a loess smooth.
The example carries all of this out, displaying the code and the resulting plots.
So this book does not merely document routines; it shows how to do EDA. The helpful summaries, intuitive explanations, and comprehensive examples make the text so much more than a software cookbook. The presentation is varied (as with hierarchical clustering versus projection pursuit), so that methods that require shorter or longer explanations or examples are treated appropriately. The authors have done a great service by bringing together so many EDA routines, but their main accomplishment in this dynamic text is providing the understanding and tools to do EDA.
To this end, all of the code fragments that constitute a given example are provided together in a MATLAB m-file on the text website. This is done for each example in the book. So for any algorithm, a user could simply run the code from the example as is (making appropriate assignments to variables). Or the user could modify the code, for example packaging the fragments into a larger function, thereby automating the entire process. The authors thus provide not only a library of valuable EDA functions via the EDA Toolbox; they also provide, via the examples, typical and easily modifiable frameworks in which to call the functions.
The EDA Toolbox is even user-friendly in that much of its functionality is available not only from the MATLAB command line, but also via GUIs. A master GUI leads to the other GUIs, but they can be accessed independently as well. Each data analysis GUI has graphing capabilities and/or a convenient button that opens up the Graphical EDA GUI in which various graphical display options are available. There are other linkages across the GUIs, for example buttons to the dimensionality reduction GUI for data preprocessing. (A separate time-saving GUI not incorporated into this framework nor listed in the appendix with the rest of the GUIs generates data from one of six finite mixture models discussed in the text.)
An appendix lists many downloadable MATLAB toolboxes that implement various EDA algorithms. Another appendix lists the functions in the EDA Toolbox and the EDA Toolbox GUIs along with other MATLAB functions that are useful for EDA. Many of the latter are included only in the MATLAB Statistics Toolbox, so ideally it will be included in the user’s MATLAB installation. For MATLAB beginners, a brief introduction to the software is provided in another appendix. It includes pointers to further resources online.
This text, along with the EDA Toolbox, is an excellent resource. Even readers with limited background can quickly be analyzing data and plotting it in interesting ways. For practitioners of exploratory data analysis who use MATLAB, and ideally also the Statistics Toolbox, I highly recommend this book.
David A. Huckaby is associate professor of mathematics at Angelo State University.