This book presents 15 real-world applications on data mining with the R language. These 15 applications have the following properties:
Topics: They span a broad spectrum of topics including data mining applied to web traffic, sociological analysis of twitter messages, digital libraries, marketing optimization, customer profiling, predicting bank loan default, image classification, crime analysis, sports, and web traffic.
Data: They span a broad spectrum of data challenges in terms of data size, data type, data mining goals, and analytic methodologies.
Techniques: They are illustrative of a variety of visual and analytic techniques including dealing with missing values, outlier detection, missing imputation, correlation coefficient analysis, decision trees, and model selection.
Those readers who, in deciding whether to use the book, wish to further evaluate the material , can find free R code, tutorials and data for the book at the RDataMining website.
The targeted audience of this book includes industry data miners, data analysts, and data scientists. The book can be used as a primary or secondary text in industrial training courses on data mining; the book can also be used as a secondary support text in university courses related to data mining.
It should be emphasized that data methods have increased tremendously in the past few decades; consequently, our teaching of statistics should be adjusted to deal with what students will find in the real world. The book contains a wealth of modern material that should be covered in more depth in statistics courses: for example, missing data, outlier detection, missing imputation, correlation coefficient matrices, principles of model selection, text mining, and decision trees. The book can be used as a basis for a project in a two or four-week module in a second semester statistics course. The book could similarly be used in a first semester statistics course; although one would have to cut out some material to make room for a two-week project module, the resulting experience for the students makes this well worth while. The book is very well suited for projects; since it lacks exercises it shouldn't be used as a primary text.
The key feature of R is that unlike other statistical software packages, for example, Excel, Minitab and SAS, it is free. Besides being free, R is expandable, with over 4000 packages supported by R communities around the world. R is widely used for both statistics and data mining in scientific and business applications. We can make the following comparisons:
· Excel: Although, for example, Excel is accessible, it being part of the Microsoft office package, Excel is limited in the additional statistical packages it offers (For example, excel is poor for text mining and decision trees).
· Minitab: Minitab, while student friendly and almost completely menu driven, does not offer the rich set of statistical functions available in SAS and R, for example, functions for outlier detection, decision trees, and text mining.
· SAS: The SAS language, www.sas.com, is a good alternative to R, since SAS offers a generous free package for academics including tutorials and learning materials (see http://www.sas.com/en_us/industry/higher-education/on-demand-for-academics.html); however, SAS itself is not free; R can be used on a local machine and is always available (e.g. after the course).
The book has many hot and recent packages; many are written or have theory based on results developed since 2010. The book presents detailed examples of topics — for example text mining and decision trees — that previously could only be done by SAS.
Russell Jay Hendel, RHendel@Towson.Edu, holds a Ph.D. in theoretical mathematics and an Associateship from the Society of Actuaries. He teaches at Towson University. His interests include discrete number theory, applications of technology to education, problem writing, actuarial science and the interaction between mathematics, art and poetry.