You are here

Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models

Julian J. Faraway
Publisher: 
Chapman & Hall/CRC
Publication Date: 
2017
Number of Pages: 
399
Format: 
Hardcover
Edition: 
2
Series: 
Texts in Statistical Science
Price: 
99.95
ISBN: 
9781498720960
Category: 
Textbook
[Reviewed by
Robert W. Hayden
, on
11/28/2017
]

R is a programming language for statistics. Julian Faraway appears to have an excellent reputation within the R community, at least if we judge by how often his work is cited approvingly. The work at hand seems to bear this out, and suggests that he offers sage advice, not only on how to code things in R, but also on what to code. To explain that assessment, we need first to explain what this book is about.

Many MAA members will have taught or taken an introductory statistics course. There one might study “regression,” which in that context generally means fitting lines to bivariate data. Students often get the impression that the process is called “linear” because we fit lines, but statisticians generally means something else by “linear” here. The idea is that the equations we have to solve to find the slope and intercept are linear. A good point of reference here might be a exercise commonly set whenever students are learning to solve systems of linear equations in more than two variables. The student might be given the coordinates of three points in the plane and asked to find the equation \(y=ax^2+bx+c\) of a parabola that passes through those points. The student then plugs the coordinates of each point in turn into the quadratic template and obtains three linear equations in \(a\), \(b\) and \(c\).

A statistician would consider fitting a quadratic (or any polynomial) to data an instance of linear regression because the equations we solve for the coefficients are linear. (Of course, a student in a first course rarely sees these equations — only a formula for the solution.) The real next step is not fitting curves, but fitting data in higher dimensions. This is much more complicated, and generally takes up an entire course in (linear) multiple regression. This book covers topics beyond that course. In addition to assuming the reader has taken an introductory statistics course, and a regression course, the author assumes the reader is comfortable with using the vocabulary and notation of matrix algebra, though of course R will handle all the computations. R is a good choice here as it could be hard to find another program that covers all the many techniques discussed in this book. No prior knowledge of R is assumed, but experience with some programming language will be very helpful.

Faraway groups his extensions of multiple regression into three classes. For the sake of brevity, this review will simplify and paraphrase what those classes are, we hope without misrepresenting the author’s intent. The first generalization is to the assumption in all of the foregoing that, for each fixed level of values of the independent variables, \(y\) is a random variable following a normal distribution. Perhaps the simplest example that does not meet that assumption is predicting a binary outcome such as the winner of the World Series, or the survival of a patient. Two outcomes is small compared to the possible values of a normal distribution which are of the cardinality of the continuum. The book covers many other sorts of failures of the normality assumption as well, and many remedies.

The second class of extensions is made up of situations where the observations are not independent. Classical examples are time series and repeated measures, such as a patient’s blood pressure taken at various times over the course of a year.

The final category is “nonparametric” regression. Often in statistics “nonparametric” is taken to mean that we make no assumptions about population parameters but here it means we do not even try to estimate parameters. As a simple example of this, fitting a quadratic equation to bivariate data involves assuming the true relation is indeed quadratic, and estimating the coefficients of the quadratic. But in many an engineering lab, a smooth curve may be fit to data by eye, perhaps with the aid of French curves. This means we have no equation for the curve, but we may be content with graphical predictions or interpolations. If that seems like a big disadvantage, bear in mind that fitting a parametric model requires that we know the right form of equation to fit. Nonparametric fitting lets the data determine the curve. This third class includes some data science methods such as regression trees and neural networks.

Now that we know what the book is about, let us examine what sort of book it is. It is probably not a textbook for the material covered, which is very diverse. Instead, it might profitably be used as a lab manual in conjunction with one or more textbooks on the methods. Numerous references are included in each chapter to classic texts and articles. The book also makes an excellent reference for the non-specialist in these areas. In addition to the references, the reader gets a brief summary of the main results and issues, followed by example R code for applying the methods. There is sufficient detail about R that one could figure out how to do these analyses oneself, but there is no systematic presentation of the language itself — just enough to deal with the situation du jour. For a student, the most useful part of this book may be the exercises, which could serve as a model for all other applied statistics textbooks. Rather than drill exercises, here the reader is typically asked to apply multiple methods to a real data set as a way of learning about both the methods and the data. The examples in the text are similar, with many data sets well chosen to make an important point.

We end with as assortment of minor issues. The paper makes it hard to write notes or highlight without showing through to the other side. The text would benefit greatly from the inclusion of color graphics. There are too many typos. The author does not always separate exploration from inference. (Classical inference is for testing hypotheses generated before the data are gathered. There is nothing wrong with exploring data to generate new hypotheses, but those hypotheses normally need to be tested on new data.) The writing style is generally clear but terse and sometimes choppy.

Despite some minor flaws, this book is highly recommended as a reference, lab manual, or source of examples to extend book learning to real situations.


After a few years in industry, Robert W. Hayden (bob@statland.org) taught mathematics at colleges and universities for 32 years and statistics for 20 years. In 2005 he retired from full-time classroom work. He contributed the chapter on evaluating introductory statistics textbooks to the MAA’s Teaching Statistics.

Introduction

 

Binary Response
Heart Disease Example
Logistic Regression
Inference
Diagnostics
Model Selection
Goodness of Fit
Estimation Problems

 

Binomial and Proportion Responses
Binomial Regression Model
Inference
Pearson’s χ2 Statistic
Overdispersion
Quasi-Binomial
Beta Regression

 

Variations on Logistic Regression
Latent Variables
Link Functions
Prospective and Retrospective Sampling
Prediction and Effective Doses
Matched Case-Control Studies

 

Count Regression
Poisson Regression
Dispersed Poisson Model
Rate Models
Negative Binomial
Zero Inflated Count Models

 

Contingency Tables
Two-by-Two Tables
Larger Two-Way Tables
Correspondence Analysis
Matched Pairs
Three-Way Contingency Tables
Ordinal Variables

 

Multinomial Data
Multinomial Logit Model
Linear Discriminant Analysis
Hierarchical or Nested Responses
Ordinal Multinomial Responses

 

Generalized Linear Models
GLM Definition
Fitting a GLM
Hypothesis Tests
GLM Diagnostics
Sandwich Estimation
Robust Estimation

 

Other GLMs
Gamma GLM
Inverse Gaussian GLM
Joint Modeling of the Mean and Dispersion
Quasi-Likelihood GLM
Tweedie GLM

 

Random Effects
Estimation
Inference
Estimating Random Effects
Prediction
Diagnostics
Blocks as Random Effects
Split Plots
Nested Effects
Crossed Effects
Multilevel Models

 

Repeated Measures and Longitudinal Data
Longitudinal Data
Repeated Measures
Multiple Response Multilevel Models

 

Bayesian Mixed Effect Models
STAN
INLA
Discussion

 

Mixed Effect Models for Nonnormal Responses
Generalized Linear Mixed Models
Inference
Binary Response
Count Response
Generalized Estimating Equations

 

Nonparametric Regression
Kernel Estimators
Splines
Local Polynomials
Confidence Bands
Wavelets
Discussion of Methods
Multivariate Predictors

 

Additive Models
Modeling Ozone Concentration
Additive Models Using mgcv
Generalized Additive Models
Alternating Conditional Expectations
Additivity and Variance Stabilization
Generalized Additive Mixed Models
Multivariate Adaptive Regression Splines

 

Trees
Regression Trees
Tree Pruning
Random Forests
Classification Trees
Classification Using Forests

 

Neural Networks
Statistical Models as NNs
Feed-Forward Neural Network with One Hidden Layer
NN Application
Conclusion

 

Appendix A: Likelihood Theory
Appendix B: About R

 

Bibliography

Index