You are here

Data Clustering: Theory, Algorithms, and Applications

Guojun Gan, Chaoqun Ma, and Jianhong Wu
Publication Date: 
Number of Pages: 
[Reviewed by
Fabio Mainardi
, on
Data Clustering : Theory, Algorithms and Applications describes more than 50 algorithms for clustering data, grouped according to the underlying methodology: center-based, search-based, graph-based, grid-based, density-based, model-based. Hierarchical clustering and fuzzy clustering are covered as well.  Pseudo-code is provided for each algorithm, the relevant mathematical concepts are succinctly presented, theorems are generally stated but not proved (references to the relevant publications allow the interested reader to explore, more in depth, specific algorithms and concepts). 
Part I of the book present very general material on data types, data standardization, similarity measures and visualization (more precisely, algorithms to visualize high-dimensional data, like t-SNE or MDS).
Part II, in addition to listing the various clustering algorithms, includes a chapter on the evaluation of clustering: “to compare the clustering results of different clustering algorithms, it is necessary to develop some validity criteria. Also, if the number of clusters is not given in the clustering algorithms, it is a highly nontrivial task to find the optimal number of clusters in the dataset” (from page 277). A variety of indices, including Rand’s and Dunn’s index, are included.
Finally, part III presents some software for clustering and part IV is dedicated to two applications: clustering gene expression data, and clustering variable annuity policies. Chapter 19 contains a listing of available packages for R and Python; note, however, that one can easily find the up-to-date listings online
The list of algorithms chosen by the authors is certainly not exhaustive, however the most commonly used algorithms are included, the most conspicuous omission being perhaps spectral clustering. 
For a reader looking for a textbook, this one might not be the optimal choice: no problems/exercises are proposed, and the presentation is somewhat terse. A classical reference for an introductory course would be, for example, Kaufman and Rousseeuw. I see this book more as a reference for the experienced data scientist, when one needs, for example, a quick reference on density-based algorithms, and pointers to the related bibliography. 
I have some doubts on how many readers will actually adopt the lightweight Java framework presented in chapter 20, especially because the vast majority of researchers and practitioners in the industry today use R, Python, MATLAB, or SAS. I might be wrong, but I think that two more chapters with applications to ‘real-world’ problems would add a lot more value to the book.


Fabio Mainardi ( is a mathematician working as senior data scientist at Nestlé Research, Switzerland. His mathematical interests are number theory, functional analysis, discrete mathematics and probability.