You are here

Data Mining: Fool's Gold? Or the Mother Lode?

Richard De Veaux wants everyone in the world to love statistics. An ambitious goal, but the “strange, whirlwind, and personal view of data mining” that De Veaux offered at a recent MAA lecture spoke powerfully to the Williams College professor’s ability to improve public opinion of this much-misunderstood discipline. De Veaux had the audience laughing, learning, and—perhaps most importantly—leaving with the impression that statistics is neither boring nor sinister, but lively and readily deployable for the common good.   

Titled “Data Mining: Fool’s Gold? Or the Mother Lode?” De Veaux’s talk was not only part of the MAA’s NSA-supported Distinguished Lecture Series, but also a celebration of Mathematics Awareness Month, jointly sponsored by the American Statistical Association.

De Veaux started simply, with a Google search for “data mining” and Dilbert cartoons to illustrate “the search for hidden patterns in lots of data.” It’s making models and predictions—doing statistics, in other words—on “big data,” with bigness open to interpretation, he noted.

De Veaux was so bowled over at a recent conference by the volume of data—petabytes an hour—informing scientists’ hunt for the Higgs boson—a particle, or set of particles, that might give others mass—that he scaled back claims about the size of the data sets he analyzes. “I changed my slides quickly,” he recalled. “I crossed out ‘big’ and I said, ‘well, medium-sized.’”

Exploratory statistics on big data—that’s what data mining is. What it isn’t—contrary to the wishful thinking of CEOs or the ease with which characters on24 milk information from vast seas of numbers—is automatic or instantaneous. Executives want to “buy some software . . . push F6, and have knowledge start spewing out immediately,” De Veaux said, but it requires time, thought, and even humanity on occasion, to coax secrets out of spreadsheets. 

Data mining got its start in the field of customer relationship management, but today it finds application everywhere, according to De Veaux: in fraud detection and credit scoring, genomics and clinical trials, national security, and naturally, social networks. “Google and Facebook are constantly changing their privacy policy because they want all of your data,” said De Veaux. “It’s not clear yet what they’re going to do with all of this.”

De Veaux used the tools of his trade to illuminate some mysteries. He showed how a simple model called a decision tree can yield insight into which variables are most predictive of a certain outcome. To gauge the likelihood that a particular passenger on the ill-fated Titanic would survive the ship’s sinking, for example, it is most important to know the passenger’s gender. If the passenger was male, the next most predictive piece of information is whether he was an adult or a child at the time of the luxury liner’s maiden voyage. A decision tree is simple, De Veaux conceded, but it often tells a story, as can other tools in statistics.

For the story behind the data charges his daughter, Scyrine, accumulated during her first year at Wesleyan University, De Veaux turned to a mosaic plot. He transformed records readily obtainable from Verizon into a colorful array of rectangles that encoded visually which numbers Scyrine texted and when. “What do you do at five?” he asked his daughter, his interest piqued by a precipitous drop in messaging at that hour. 

De Veaux’s data mining didn’t tell him exactly what Scyrine was up to at 5 p.m. (turns out that’s when she usually went to the gym); it just helped him spot an aberration, identifying an area of possible interest.

Decision trees and mosaic plots, bootstrap aggregation, and random forests—these tools and techniques can narrow a field of hundreds or thousands of variables down to a promising handful. They can help zero in on variables useful in predicting whether the recipient of a sheet of address labels fromParalyzed Veterans of America will donate money to the charity or whether an ingot of aluminum destined to become a Boeing Dreamliner will crack when rolled into sheeting.

However, De Veaux stressed, “data mining always involves thinking.” Solutions arise most often when one alternates between number crunching and big-picture consideration of what makes sense in the real-world context of the problem. 

De Veaux summarized it this way: “Good data + Domain knowledge + Data mining + Thinking = Success.”
Katharine Merow


 Listen to the full lecture (mp3)

 Listen to an interview with De Veaux and Ivars Peterson (mp3)

  Watch the full lecture on the MAA YouTube Channel

 

This MAA Distinguished Lecture was funded by the National Security Agency.