Mathematics faculty, especially those of us who teach at smaller institutions, increasingly find ourselves called upon not only to provide instruction in statistics, but also to act as statistical consultants for students and colleagues in other departments and to direct student research in projects related to data analysis. Some of us find ourselves designing and teaching a second course in statistics, or even teaching the elements of the much-ballyhooed field of “Data Science.”
Data analysis these days is often conceptualized as an activity that draws on three areas of competence:
- substantive knowledge of the area of application (biology, sociology, etc.);
- computer programming/hacking skills.
Although the first area (statistics) requires a substantial change in perspective — it is not really a branch of mathematics at all — at least the methodology of inferential statistics employs reassuringly-heavy doses of familiar fields of mathematics such as linear algebra and probability theory. In addition, many mathematicians have gone before us into statistical territory, and are active participants in several energetic organizations (including the SIGMAA on Statistics Education and the United States Conference on the Teaching of Statistics) to help fellow mathematicians along. As for the area of substantive knowledge, we have learned to rely on our colleagues (and students) for guidance.
It’s really the third area — programming/hacking skills — that holds many of us back. These are not a timeless set of skills, and acquiring them is a matter of breadth, flexibility and patience, rather than depth, concentration and clever deduction. Hacking simply isn’t a part of traditional mathematical training. In many ways it runs counter to the mathematical temperament, so the sense of satisfaction attendant upon having, at last, wrangled one’s computer into Doing Something is very much an acquired taste, for most of us.
The tools with which we must acquire at least a passing familiarity are so many, and the initial hurdles (installation and configuration of free software, for instance) are potentially so frustrating that we need a guide: we need someone who inspires us with what can be accomplished eventually, and who tries to help past over some of the initial points of blockage.
Christopher Gandrud, a political economist at the Hertie School of Governance in Berlin, aims to act as such a Guide. His text is organized around the concept of reproducible research.
According to Roger Peng, a biostatistician at the Bloomberg School of Public Health at Johns Hopkins University, research in the computational sciences is said to be reproducible provided that: “the data and [computer] code used to make a finding are available and they are sufficient for an independent researcher to recreate the finding.” Although this definition addresses the reliability of scientific research, it also has implications for the practical conduct of research. Conducting one’s research in a reproducible way makes for
- better work habits (it’s much easier to resume your work after a break, or revise it much later);
- easier collaboration with colleagues and more effective teaching when you are advising students on projects;
- enhanced impact of finished research (your work gets more attention when others can access it easily).
The three basic components of reproducible research in data analysis are:
- a programming environment for statistical analysis;
- a system for gathering, storing and accessing data, and for recording various versions of one’s data and one’s analysis;
- a system for communicating the results of analysis.
For each of these three components one can choose between several tools, but Gandrud emphasizes a set of options that has become overwhelmingly popular in statistics and the natural and social sciences.
For a programming environment he recommends R, along with the Integrated Development Environment made by RStudio. R is a statistical programming language that was built to resemble bell Lab’s S, but unlike S it is entirely free software and is extensible through various convenient web-based repositories.
To store and access data Gandrud distinctly favors the version control system known as Git along with the popular web service GitHub. Git and GitHub require a greater learning investment than standard cloud storage services such as Dropbox, but they offer considerably greater flexibility and power. The reproducible research novice may begin with Dropbox but will find herself moving to Git after a few months, especially since it has been integrated conveniently into RStudio. In addition, new contributed R packages permit applications written in R to be run directly from their GitHub repositories.
For communication of results Gandrud recommends the remarkable R package known as knitr, which implements the paradigm of literate programming — the weaving of text and computer code into a single text file that is then processed (“knit”) into any of several chosen formats: HTML, pdf or even a Word document. The source file is a data report that contains all of the ode necessary to read in the data, run statistical routines, produce graphs, etc., but also includes one’s analysis and interpretation. The knitted result is a complete and polished report, whereas the source file is a record of that permits a colleague to see exactly how one arrived at one’s numerical results, and to reproduce these results exactly.
The literate programming paradigm is remarkably convenient. If you find a mistake in your data, you do not have to re-run all of your routines, make all-new graphs and insert the results into your report. You simply edit the data file and push a button to re-knit your report. Reports can be distributed as hard-copies or published immediately to the web. They need not be static documents, either. With appropriate formatting they can be knit into interactive documents that can be read by anyone with a computer that runs R, or that can be hosted on the web by special “Shiny” servers that run R on the back end.
Remarkably, all of the above tools are completely free. (R is free by design. RStudio exists in free and commercial versions. Github charges a monthly fee only for private repositories. Shiny servers come in free and commercial versions as well.) Students therefore have access to all of them. Nowadays when I advise an honors thesis or summer research project involving data analysis, I make sure that the student reviews R and we always begin with a Git and GitHub tutorial.
I entered the world of data computing with great reluctance, and only because I saw the benefits of teaching statistics with R, even for elementary students. The first edition of Reproducible Research with R and RStudio was an invaluable companion in the early stages of my journey, and I trust that the second edition will be equally useful to aspiring data analysts.
Addendum: The following additional resources may be of interest to fellow instructors:
- For insights into the R language beyond basic procedural programming and statistical commands, consult The Art of R Programming by Norman Matloff, and then Advanced R by the noted R developer Hadley Wickham (available free online at http://adv-r.had.co.nz/).
- R has a steep learning curve, but it is when suitably modified it may be used to teach undergraduates — even non-majors in elementary courses. Most of the necessary mediation is accomplished by Project Mosaic (http://mosaic-web.org/), which has developed the mosaic package for R (https://cran.r-project.org/web/packages/mosaic/index.html). An even more simplified approach is implemented in the tigerstats package (https://cran.r-project.org/web/packages/tigerstats/index.html).
- Students will learn a bit of R in your statistics courses, but if you want them to sharpen their R skills, become acquainted with advanced data collection, Git and GitHub, then you need not design instructional materials all on your own. Refer students to the Coursera’s Data Science Specialization (https://www.coursera.org/specialization/jhudatascience/1/courses). The courses themselves may be taken for free (with no certificate earned), and they are excellent as assignments to student prior to a research project, or as part of a data analysis capstone course.
Homer White is Professor of Mathematics at Georgetown College, in Kentucky. A typical Jack-of-All-Trades small-college mathematician, he enjoys the teaching of statistics at all levels, statistical consultation, and even institutional research. His interests and occasional forays into research in the history of mathematics include the geometrical works of Leonhard Euler and the mathematics of classical India.