Tidy Modeling with R
Version 0.0.1.9009 (2021-05-06)
This is the website for Tidy Modeling with R. This book is a guide to using a new collection of software in the R programming language for model building, and it has two main goals:
First and foremost, this book provides an introduction to how to use our software to create models. We focus on a dialect of R called the tidyverse that is designed to be a better interface for common tasks using R. If you’ve never heard of or used the tidyverse, Chapter 2 provides an introduction. In this book, we demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the tidymodels packages.
Second, we use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices made during the development and/or selection of the models. Whenever possible, our software, documentation, and other materials attempt to prevent these and other pitfalls.
This book is not intended to be a comprehensive reference on modeling techniques; we suggest other resources to learn such nuances. For general background on the most common type of model, the linear model, we suggest Fox (2008). For predictive models, Kuhn and Johnson (2013) is a good resource. Also, Kuhn and Johnson (2020) is referenced heavily here, mostly because it is freely available online. For machine learning methods, Goodfellow, Bengio, and Courville (2016) is an excellent (but formal) source of information. In some cases, we describe models that are used in this text but in a way that is less mathematical, and hopefully more intuitive.
Investigating and analyzing data are an important part of the model process, and an excellent resource on this topic is Wickham and Grolemund (2016).
We do not assume that readers have extensive experience in model building and statistics. Some statistical knowledge is required, such as random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course.
Tidy Modeling with R is currently a work in progress. As we create it, this website is updated. Be aware that, until it is finalized, the content and/or structure of the book may change.
This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting typos, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the
contributing.md file. Also, be aware that this effort has a code of conduct, which can be found at
The tidymodels packages are fairly young in the software lifecycle. We will do our best to maintain backwards compatibility and, at the completion of this work, will archive and tag the specific versions of software that were used to produce it.
This book was written in RStudio using bookdown. The
tmwr.org website is hosted via Netlify, and automatically built after every push by GitHub Actions. The complete source is available on GitHub. We generated all plots in this book using ggplot2 and its black and white theme (
theme_bw()). This version of the book was built with R version 4.0.5 (2021-03-31), pandoc version 2.7.3, and the following packages:
|nlme||3.1-152||CRAN (R 4.0.5)|
|rpart||4.1-15||CRAN (R 4.0.5)|
Fox, J. 2008. Applied Regression Analysis and Generalized Linear Models. Second. Thousand Oaks, CA: Sage.
Goodfellow, I, Y Bengio, and A Courville. 2016. Deep Learning. MIT Press.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Kuhn, M, and K Johnson. 2020. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.
Wickham, H, and G Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.