“At the source of every error which is blamed on the computer
you will find at least two human errors, including the error of
blaming it on the computer.”
— From the “fortunes” database, found on many Unix-like computers
They say you can’t judge a book by its cover, but what about its dust jacket? In this case it’s luridly designed with a stylized skull and cross bones, sensational title, and even a scapegoat — namely mathematics. On the face of it we seem to have some fearless yellow journalism, an impression enhanced by the fact the dust jacket is in fact yellow.
Fortunately for us, and for mathematics, the content is a reasonable editorial, not on mathematics, but on the various ways in which the author feels that “big data” is misused today. There is an introduction, ten chapters, and a conclusion. Each chapter discusses a particular area in which “big data” is commonly used. Most MAA members teach at the college level, and so might be interested in the coverage here of teacher evaluations and ratings of colleges
If you look at a data science textbook, the idea there is that we search through masses of data to find useful correlations that help us make predictions. O’Neil charges that often variables are selected on the basis of human prejudices, but does not offer much to substantiate that. What may well be true is that variables based on prejudice are considered as possibilities, but it’s unlikely they would be retained if they prove to have no predictive power. In fact, Data Science may prove an enemy of prejudice in eliminating such variables based on fact.
On the other hand, the author gives little attention to the old adage that “correlation is not causation.” Data Science models are used for making predictions rather than unearthing causal connections. If the data show that people who buy a certain brand of dog food also often buy red wine, the Data Scientist will not ask why, but simply provide those buying that brand with coupons for red wine when they pay for their dog food. In such applications, it may be worthwhile to do so if we only get two additional sales per 1000 coupons.
But things look quite different if we deny a person a job — or contract renewal or a mortgage or parole from prison — based on such a model. Would we be comfortable firing 1000 teachers to get rid of two incompetents? Many of the examples in this book concern such high stakes situations. Such are probably not typical uses of Data Science, yet they are important because their errors are so costly. Which brings us back to the “fortunes” quote. Is it the techniques that are at fault or the use humans make of them?
The author makes a couple of novel points about feedback in Data Science models. The first is that many of these models are applied in a way that does not allow verification of the accuracy of the predictions. Unfortunately, this is most common for high stakes model. The person handing out wine coupons probably will keep track of whether anyone uses them. But we will never know if turning Lynn down for a mortgage was the right decision.
Then there is what I will call the self-fulfilling prophecy feedback loop. Perhaps an initial study shows high crime rates in a poor neighborhood, so more police are sent there. More police will mean more crimes are detected, thereby raising the (reported) crime rate further. The rise in crime rate will make it harder for those who live in the neighborhood to get jobs or mortgages if those are based on models that include neighborhood crime rates. Then getting turned down for a mortgage may lead to higher auto insurance rates. Soon the poverty that led to high crime rates in the first place will increase, and lead to still higher crime rates. Is this a consequence of chasing correlations and ignoring causes?
Some of the online comments on this book dismiss it as mere politics. Certainly the Conclusion chapter places the author near Bernie Sanders on the political spectrum, which will not be a surprise after reading the rest of the book. Yet, like National Public Radio or Consumers Union, O’Neil provides a valuable service by attempting an honest presentation of the facts. After reading this book from cover to cover, I have to think that the critics are unable to listen to someone who does not fully agree with them, that is, to those from whom they might learn something.
There are 34 pages of references but they favor blogs, mass media, and reports from advocacy groups. There is a seven page index. The book as a whole is highly recommended for the issues it raises, whether or not one agrees with all the author’s points or her proposed solutions.
After a few years in industry, Robert W. Hayden (firstname.lastname@example.org) taught mathematics at colleges and universities for 32 years and statistics for 20 years. In 2005 he retired from full-time classroom work. He now teaches statistics online at statistics.com and does summer workshops for high school teachers of Advanced Placement Statistics. He contributed the chapter on evaluating introductory statistics textbooks to the MAA 's Teaching Statistics.