Richard De Veaux wants everyone in the world to love statistics. An ambitious goal, but the “strange, whirlwind, and personal view of data mining” that De Veaux offered at a recent MAA lecture spoke powerfully to the Williams College professor’s ability to improve public opinion of this much-misunderstood discipline. De Veaux had the audience laughing, learning, and—perhaps most importantly—leaving with the impression that statistics is neither boring nor sinister, but lively and readily deployable for the common good.
Titled “Data Mining: Fool’s Gold? Or the Mother Lode?” De Veaux’s talk was not only part of the MAA’s NSA-supported Distinguished Lecture Series, but also a celebration of Mathematics Awareness Month, jointly sponsored by the American Statistical Association.
De Veaux started simply, with a Google search for “data mining” and Dilbert cartoons to illustrate “the search for hidden patterns in lots of data.” It’s making models and predictions—doing statistics, in other words—on “big data,” with bigness open to interpretation, he noted.
De Veaux was so bowled over at a recent conference by the volume of data—petabytes an hour—informing scientists’ hunt for the Higgs boson—a particle, or set of particles, that might give others mass—that he scaled back claims about the size of the data sets he analyzes. “I changed my slides quickly,” he recalled. “I crossed out ‘big’ and I said, ‘well, medium-sized.’”
Exploratory statistics on big data—that’s what data mining is. What it isn’t—contrary to the wishful thinking of CEOs or the ease with which characters on24 milk information from vast seas of numbers—is automatic or instantaneous. Executives want to “buy some software . . . push F6, and have knowledge start spewing out immediately,” De Veaux said, but it requires time, thought, and even humanity on occasion, to coax secrets out of spreadsheets.
De Veaux used the tools of his trade to illuminate some mysteries. He showed how a simple model called a decision tree can yield insight into which variables are most predictive of a certain outcome. To gauge the likelihood that a particular passenger on the ill-fated Titanic would survive the ship’s sinking, for example, it is most important to know the passenger’s gender. If the passenger was male, the next most predictive piece of information is whether he was an adult or a child at the time of the luxury liner’s maiden voyage. A decision tree is simple, De Veaux conceded, but it often tells a story, as can other tools in statistics.
For the story behind the data charges his daughter, Scyrine, accumulated during her first year at Wesleyan University, De Veaux turned to a mosaic plot. He transformed records readily obtainable from Verizon into a colorful array of rectangles that encoded visually which numbers Scyrine texted and when. “What do you do at five?” he asked his daughter, his interest piqued by a precipitous drop in messaging at that hour.
De Veaux’s data mining didn’t tell him exactly what Scyrine was up to at 5 p.m. (turns out that’s when she usually went to the gym); it just helped him spot an aberration, identifying an area of possible interest.
Decision trees and mosaic plots, bootstrap aggregation, and random forests—these tools and techniques can narrow a field of hundreds or thousands of variables down to a promising handful. They can help zero in on variables useful in predicting whether the recipient of a sheet of address labels fromParalyzed Veterans of America will donate money to the charity or whether an ingot of aluminum destined to become a Boeing Dreamliner will crack when rolled into sheeting.
However, De Veaux stressed, “data mining always involves thinking.” Solutions arise most often when one alternates between number crunching and big-picture consideration of what makes sense in the real-world context of the problem.
De Veaux summarized it this way: “Good data + Domain knowledge + Data mining + Thinking = Success.”
This MAA Distinguished Lecture was funded by the National Security Agency.