You are here

Statistical Modeling and Machine Learning for Molecular Biology

Alan M. Moses
Publisher: 
Chapman & Hall/CRC
Publication Date: 
2017
Number of Pages: 
264
Format: 
Paperback
Series: 
Chapman & Hall/CRC Mathematical and Computational Biology Series
Price: 
69.95
ISBN: 
9781482258592
Category: 
Monograph
[Reviewed by
Christopher P. Thron
, on
09/18/2017
]

This book is intended for biologists, not mathematicians. The stated goal is to serve as a “travel guide” to statistics and machine learning. The comparison is apt. Like most travel guides, it only gives on-the-surface descriptions of most of the areas visited (as the author himself admits). The most thoroughly covered topics are from classical statistics: hypothesis tests, multiple testing (which apparently is a critical issue in genetics-related applications), parameter estimation (several elementary maximum likelihood calculations are done out in extensive detail), and regression (univariate and multivariate). Topics ordinarily classified as “machine learning” that receive a fair amount of attention include techniques for clustering (hierarchical, k-means, k-medioids, Gaussian mixture) and classification (logistic regression, LDA, Naïve Bayes). More recent popular (and powerful) techniques such as Support Vector Machines and Random Forests receive passing mention. Neural networks are barely touched on.

Mathematicians might be drawn to this book looking for a reference that will help them understand applications in molecular biology. Unfortunately, the book is not well-suited for such a purpose, because no attempt is made to explain either the terminology or the practical situations being addressed (“expression levels”, “annotations”, “gene enrichment”, “eQTLs”, and so on).

The back cover asserts that the book “assumes no background in statistics”. This is publisher’s hyperbole. Beginners would be utterly confused by the sketchy expositions of basic concepts, as well as the notational sloppiness. The concepts of random variable and event are never defined, and no notational distinction is made between random variables and the values that they may assume. The difference between a probability density and a probability mass function is never explained: indeed in Figure 2.5, the two are conflated. The author tends to make up his own informal notations. This is a questionable practice in an introductory book, whose readers are presumably just starting out in the field, and are not well-grounded in the basic concepts. The notational laxness simply introduces more occasions for confusion.

Besides carelessness with the notation, the presentation is marred by inaccurate statements scattered throughout the book, including:

  • (p. 23) “If 2 events are independent, then the conditional probability should be the same as the joint probability” (should say “unconditional” instead of “joint”)
  • (p. 30) The sums for the Wilcoxon test statistics are indexed incorrectly, and the two “different” statistics given are negatives of each other (they sum to 0).
  • (p. 38) Referring to a uniform distribution, “The probability of observing X is 1/ (maximum – minimum)”
  • (p. 69) A diagonal covariance leads to an isotropic distribution (this is only true if all diagonal entries are equal).
  • (p. 79) The author claims that p-values can’t be generalized to higher dimensions because one cannot characterize multidimensional data points that are “more extreme” than the observation. But such characterization is quite straightforward — points with lower probability density values are “more extreme.” Given a multidimensional result and a hypothesized distribution, using Monte Carlo it is quite straightforward to estimate higher-dimensional p-values.
  • (p. 145) “Regression aims to model the statistical dependence between i.i.d. observations in two or more dimensions”. In fact, there are no assumptions on the distribution of the independent variables, which are not even treated as random variables in the mathematical analysis. Also on the same page, the author assumes homoscedacticity without telling the reader.

The figures also would have benefitted from some proofreading. The 3-d graphs in Figure 4.2 are scarcely recognizable as such. Several figures use lowercase to represent variables that are uppercase in the text.

One very attractive feature of the book is the author’s obvious enthusiasm for the mathematics, not just as a tool but as something wonderful and amazing in its own right.

A second edition of the book which addresses the shortcomings noted above could provide a valuable bridge between microbiologists and mathematicians.


Christopher Thron (thron@tamuct.edu) is associate professor of mathematics at Texas A&M University-Central Texas. Previously he was a systems engineer for NEC, Motorola, and Freescale. His interests include computational mathematics (algorithms, modeling, and simulation), and especially in designing and providing high-quality training in computational mathematics in developing countries.

Introduction

Statistical modeling
Statistics and probability
Multiple testing
Multivariate statistics and parameter estimation

Clustering
Distance-based
Gaussian mixture models
Simple linear regression
Multiple regression and generalized linear models
Regularization
Linear classification
Non-linear classification
Evaluating classifiers and ensemble methods

Correlated data in one dimension
Hidden-Markov models
Local regression