I first learned of reproducible research around 1997 or so, in the WaveLab documentation. The idea is that all your results should be easily reproduced by somebody (including yourself six months down the road) with no ambiguity, missed steps, or various fudges. At the time it seemed like such an obvious thing to do, that I wondered why it wasn’t a universal approach. Trying to apply the ideas with the mediocre tools available at the time showed me why it wasn’t, but I always thought the objective was noble.
Fast forward (what feels like a million technological years) to when I first heard about knitr; I think it was about 2013. knitr is a package for use with R, and integrates beautifully with RStudio. It allows for weaving together text, R code, and the results of R calculations into one document. I eagerly installed it and started to use it productively, with very little pain. I have been an enthusiastic, and I thought relatively knowledgeable, user ever since. I was wrong about the latter.
The book under review, written by the creator of the knitr package, is a gold mine of ideas: things I had no idea knitr could do (integrate with different languages like python), and tricks to get around some of the awkward things I needed to do (moving all the code to an appendix for tech-fearful readers). It also explains all the guts of the system, and is especially informative about how knitr can cache results of time intensive calculations, so that they do not have to be rerun each time you compile the document if the precedents have not changed.
The book is well written, but some details are a little complex, so it is best to read the book in front of the computer so you can try things out. The main problem with this book, like all technology books, is that it is already a little outdated — you can can get more updated documentation on the web. On the other hand, it is hard to find things you don’t know to look for on the web, so having this book is very useful for showing what you can do, that you didn’t even know to look for.
In an industrial context the data you are analyzing is frequently badly messed up, requiring repeatedly obtaining new versions of the data and updating the results. In this scenario, knitr has proved hugely useful, allowing for updating reports with one push of a button. Some of the techniques in the book will make this process even more sophisticated (e.g., change a parameter and get slides as an output rather than a document). I am looking forward to implementing them in my workflow, with this book as my guide.
Peter Rabinovitch is a Senior Performance Engineer at Akamai, and as been doing data science since long before “data science” was a thing. knitr has saved his butt many times.