When data scientist Cathy O’Neil gave a Distinguished Lecture at the MAA Carriage House on October 16, she asked her audience some pretty personal questions. For instance: “How much do you care about sexiness in your movies?”
O’Neil didn’t compel anyone to answer, however; she rhetorically inquired after listeners’ enjoyment of racy films only to help explain how websites come up with product recommendations for customers.
In “Start Your Own Netflix,” which she characterized as “a straight-up, data-is-awesome, math-modeling-is-fun, let’s-think-about-how-we-do-these-things” talk, O’Neil described three different recommendation systems, discussed the massive amounts of data needed to make recommendation engines work, and indicated one way to circumvent the difficulties posed by big data.
A recommendation system is “the thing that tells you, ‘here’s what you might want to try next’”: Netflix does it for movies, Spotify for songs, Google News for headlines. O’Neil touched on three approaches: latent factor analysis, covisitation, and latent topic analysis. Latent factor analysis involves huge matrices you never actually construct; covisitation creates clusters of users based on similarities in their viewing/listening/reading histories; and latent topic analysis has, as she put it, “a lot of Bayesian statistics behind it.”
A surprising insight came to light as O’Neil conjured the unwieldy matrices of latent factor analysis: No matter what the context, the number of factors you care about is . . . wait for it . . . about 20.
“It’s kind of sad,” O’Neil said, “but we basically only have 20 dimensions of taste.”
O’Neil attributed this insight to friends at hunch.com. The website, which provides personalized recommendations based on users’ answers to a set of questions, has found that you have a consumer pretty well pegged after replies to just 20 questions.
O’Neil revisited this idea when she rattled off a list of caveats at the end of her talk. She questioned whether the data informing recommendation engines comes from a representative sample of the population, seeming to doubt how closely the preferences of folks with the leisure to rate Netflix movies reflect those of the rest of us. She also mentioned the so-called filter bubble question, the idea that if “we’re just blindly walking around” making decisions based on suggestions floated by algorithms informed by our own experience, we’ll never be exposed to anything novel.
“I personally would like to have new experiences every now and then,” O’Neil said.
Before musing on the societal ramifications of recommendation engines, however, O’Neil outlined a less philosophical problem they pose. To do their job well, recommendation engines require more data than can fit on any one computer. The more computers involved in an operation, however, the higher the probability of something going wrong, even if the likelihood of any particular machine failing is low.
“If you have 0.3 percent failure rate [per computer] and you have 1,000 computers, then you have a 95 percent chance of something failing,” O’Neil said. “And that means you’re not going to get the right answer.”
The situation sounds pretty dire, but . . . MapReduce to the rescue! MapReduce is a programming model that uses a parallel, distributed algorithm for processing large data sets, and O’Neil made it sound almost magical.
In one implementation, engineers at Google “created a mechanism whereby you gain something and you lose something,” O’Neil said. “What you gain is the ability to forget about failure.”
What you lose, it turns out, is flexibility, but it’s hard to get sore about that imposition when someone has just promised to effectively banish the possibility of failure.
And, besides, at least according to O’Neil, the algorithms needed to implement the recommendation systems she’d discussed all prove amenable to the MapReduce treatment.
After offering that aforementioned list of caveats, O’Neil closed her talk by reiterating the enjoyment she derives from the no-holds-barred wrangling of big data sets: “That’s what’s cool about being a data scientist,” she said. “You have to be scrappy.” —Katharine Merow
This MAA Distinguished Lecture was funded by the National Security Agency.