8 A model workflow

In the previous two chapters, we discussed the recipes and parsnip packages. These packages can be used to prepare the data for analysis and fitting the model. This chapter introduces a new object called a model workflow. The purpose of this object is to encapsulate the major pieces of the modeling process (previously discussed in Section 1.3). The workflow is important in two ways. First, using a workflow object encourages good methodology since it is a single point of entry to the estimation components of a data analysis. Second, it enables the user to better organize their projects. These two points are discussed in the following sections.

8.1 Where does the model begin and end?

So far, when we have used the term “the model”, we have meant a structural equation that relates some predictors to one or more outcomes. Let’s consider again linear regression as an example. The outcome data are denoted as \(y_i\), where there are \(i = 1 \ldots n\) samples in the training set. Suppose that there are \(p\) predictors \(x_{i1}, \ldots, x_{ip}\) that are used in the model. Linear regression produces a model equation of

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \ldots + \hat{\beta}_px_{ip} \]

While this is a linear model, it is only linear in the parameters. The predictors could be nonlinear terms (such as the \(log(x_i)\)).

The conventional way of thinking about the modeling process is that it only includes the model fit.

For some data sets that are straightforward in nature, fitting the model itself may be the entire process. However, there are a variety of choices and additional steps that often occur before the model is fit:

  • While our example model has \(p\) predictors, it is common to start with more than \(p\) candidate predictors. Through exploratory data analysis or using domain knowledge, some of the predictors may be excluded from the analysis. In other cases, a feature selection algorithm may be used to make a data-driven choice for the minimum predictor set for the model.
  • There are times when the value of an important predictor is missing. Rather than eliminating this sample from the data set, the missing value could be imputed using other values in the data. For example, if \(x_1\) were missing but was correlated with predictors \(x_2\) and \(x_3\), an imputation method could estimate the missing \(x_1\) observation from the values of \(x_2\) and \(x_3\).
  • It may be beneficial to transform the scale of a predictor. If there is not a priori information on what the new scale should be, we can estimate the proper scale using a statistical transformation technique, the existing data, and some optimization criterion. Other transformations, such as PCA, take groups of predictors and transform them into new features that are used as the predictors.

While the examples above are related to steps that occur before the model fit, there may also be operations that occur after the model is created. When a classification model is created where the outcome is binary (e.g., event and non-event), it is customary to use a 50% probability cutoff to create a discrete class prediction, also known as a “hard prediction”. For example, a classification model might estimate that the probability of an event was 62%. Using the typical default, the hard prediction would be event. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to raise the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample an event. While this reduces the true positive rate (which is bad), it may have a more dramatic effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works, even though it is not contained in the model fitting step.

It is important to focus on the broader modeling process, instead of only fitting the specific model used to estimate parameters. This broader process includes any preprocessing steps, the model fit itself, as well as potential post-processing activities. In this book, we will refer to this broader process as the model workflow and include in it any data-driven activities that are used to produce a final model equation.

In other software, such as Python or Spark, similar collections of steps are called pipelines. In tidymodels, the term “pipeline” already connotes a sequence of operations chained together with a pipe operator (such as %>%). Rather than using ambiguous terminology in this context, we call the sequence of computational operations related to modeling workflows.

Binding together the analytical components of a data analysis is important for another reason. Future chapters will demonstrate how to accurately measure performance, as well as how to optimize structural parameters (i.e. model tuning). To correctly quantify model performance on the training set, Chapter 10 advocates using resampling methods. To do this properly, no data-driven parts of the analysis should be excluded from validation. To this end, the workflow must include all significant estimation steps.

To illustrate, consider PCA signal extraction. This was previously mentioned in Section 6.6 as a way to replace correlated predictors with new artificial features that are uncorrelated and capture most of the information in the original set. The new features would be used as the predictors and least squares regression could be used to estimate the model parameters.

There are two ways of thinking about the model workflow. The incorrect method would be to think of the PCA preprocessing step as not being part of the modeling process:

The fallacy here is that, although PCA does significant computations to produce the components, its operations are assumed to have no uncertainty associated with them. The PCA components are treated as known and, if not included in the model workflow, the effect of PCA could not be adequately measured.

An appropriate approach would be:

In this way, the PCA preprocessing is considered part of the modeling process.

8.2 Workflow basics

The workflows package allows the user to bind modeling and preprocessing objects together. Let’s start again with the Ames data and a simple linear model:

library(tidymodels)  # Includes the workflows package

lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

A workflow always requires a parsnip model object:

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model)

lm_wflow
#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
#> Preprocessor: None
#> Model: linear_reg()
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

Notice that we have not yet specified how this workflow should preprocess the data: Preprocessor: None.

If our model were very simple, a standard R formula can be used as a preprocessor:

lm_wflow <- 
  lm_wflow %>% 
  add_formula(Sale_Price ~ Longitude + Latitude)

lm_wflow
#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
#> Sale_Price ~ Longitude + Latitude
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

Workflows have a fit() method that can be used to create the model. Using the objects create in Section 7.5:

lm_fit <- fit(lm_wflow, ames_train)
lm_fit
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
#> Sale_Price ~ Longitude + Latitude
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#> (Intercept)    Longitude     Latitude  
#>     -316.37        -2.08         3.01

We can also predict() on the fitted workflow:

predict(lm_fit, ames_test %>% slice(1:3))
#> # A tibble: 3 x 1
#>   .pred
#>   <dbl>
#> 1  5.22
#> 2  5.29
#> 3  5.28

The predict() method follows all of the same rules and naming conventions that we described for the parsnip package in Section 7.3.

Both the model and preprocessor can be removed or updated:

lm_fit %>% update_formula(Sale_Price ~ Longitude)
#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
#> Sale_Price ~ Longitude
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

Note that, in this new object, the output shows that the previous fitted model was removed since the new formula is inconsistent with the previous model fit.

8.3 Workflows and recipes

Instead of using model formulas, recipe objects can also be used to preprocess data for modeling. Section 7.5 summarized a recipe that specified several preprocessing and feature engineering steps. These are encapsulated inside the object ames_rec and are attached to the workflow:

lm_wflow %>% 
  add_recipe(ames_rec)
#> Error: A recipe cannot be added when a formula already exists.

That did not work! We can only have one preprocessing method at a time, so we need to remove the formula before adding the recipe.

lm_wflow <- 
  lm_wflow %>% 
  remove_formula() %>% 
  add_recipe(ames_rec)
lm_wflow
#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
#> 5 Recipe Steps
#> 
#> ● step_log()
#> ● step_other()
#> ● step_dummy()
#> ● step_interact()
#> ● step_ns()
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

We described the prep(), bake(), and juice() functions in Section 6.8 for using the recipe with a modeling function. This can be onerous, so the fit() method for workflow objects automates this process:

# Does `prep()`, `juice()`, and `fit()` in one step:
lm_fit <- fit(lm_wflow, ames_train)

# Does `bake()` and `predict()` automatically:
predict(lm_fit, ames_test %>% slice(1:3))
#> # A tibble: 3 x 1
#>   .pred
#>   <dbl>
#> 1  5.31
#> 2  5.30
#> 3  5.17

If we need the bare model object or recipe, there are pull_* functions that can retrieve them:

# Get the recipe and run `tidy()` method: 
lm_fit %>% 
  pull_workflow_prepped_recipe() %>% 
  tidy()
#> # A tibble: 5 x 6
#>   number operation type     trained skip  id            
#>    <int> <chr>     <chr>    <lgl>   <lgl> <chr>         
#> 1      1 step      log      TRUE    FALSE log_vqtIn     
#> 2      2 step      other    TRUE    FALSE other_K2XxD   
#> 3      3 step      dummy    TRUE    FALSE dummy_ht3jv   
#> 4      4 step      interact TRUE    FALSE interact_XqiIK
#> 5      5 step      ns       TRUE    FALSE ns_u4pZz

# To tidy the model fit: 
lm_fit %>% 
  # This returns the parsnip object:
  pull_workflow_fit() %>% 
  # Now tidy the linear model object:
  tidy() %>% 
  slice(1:5)
#> # A tibble: 5 x 5
#>   term                       estimate std.error statistic   p.value
#>   <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)                -0.531    0.296        -1.79 7.32e-  2
#> 2 Gr_Liv_Area                 0.648    0.0161       40.3  1.46e-264
#> 3 Year_Built                  0.00194  0.000139     13.9  2.69e- 42
#> 4 Neighborhood_College_Creek -0.0898   0.0332       -2.71 6.83e-  3
#> 5 Neighborhood_Old_Town      -0.0516   0.0129       -4.01 6.20e-  5

8.4 How does a workflow use the formula?

Recall from Sections 3.2 and 6.1 that the formula method has multiple purposes. One of these is to properly encode the original data into an analysis ready format. This can involve executing in-line transformations (e.g., log(x)), creating dummy variable columns, creating interactions or other column expansions, and so on. However, there are many statistical methods that require different types of encodings:

  • Most packages for tree-based models use the formula interface but do not encode the categorical predictors as dummy variables.

  • Packages can use special in-line functions that tell the model function how to treat the predictor in the analysis. For example, in survival analysis models, a formula term such as strata(site) would indicate that the column site is a stratification variable. This means that it should not be treated as a regular predictor and does not have a corresponding location parameter estimate in the model.

  • A few R packages have extended the formula in ways that base R functions cannot parse or execute. In multilevel models (e.g. mixed models or hierarchical Bayesian models), a model term such as (week | subject) indicates that the column week is a random effect that has different slope parameter estimates for each value of the subject column.

A workflow is a general purpose interface. When add_formula() is used, how should the workflow pre-process the data? Since the preprocessing is model dependent, workflows attempts to emulate what the underlying model would do whenever possible. If it is not possible, the formula processing should not do anything to the columns used in the formula. Let’s look at this in more detail.

Tree-based models

When we fit a tree to the data, the parsnip package understands what the modeling function would do. For example, if a random forest model is fit using the ranger or randomForest packages, the workflow knows predictors columns that are factors should be left as-is.

As a counter example, a boosted tree created with the xgboost package requires the user to create dummy variables from factor predictors (since xgboost::xgb.train() will not). This requirement is embedded into the model specification object and a workflow using xgboost will create the indicator columns for this engine. Also note that a different engine for boosted trees, C5.0, does not require dummy variables so none are made by the workflow.

This determination is made for each model and engine combination.

Special formulas and in-line functions

A number of multilevel models have standardized on a formula specification devised in the lme4 package. For example, to fit a regression model that has random effects for subjects, we would use the following formula:

library(lme4)
lmer(distance ~ Sex + (age | Subject), data = Orthodont)

The effect of this is that each subject will have an estimated intercept and slope parameter for age.

The problem is that standard R methods can’t properly process this formula:

model.matrix(distance ~ Sex + (age | Subject), data = Orthodont)
#> Warning in Ops.ordered(age, Subject): '|' is not meaningful for ordered factors
#>      (Intercept) SexFemale age | SubjectTRUE
#> attr(,"assign")
#> [1] 0 1 2
#> attr(,"contrasts")
#> attr(,"contrasts")$Sex
#> [1] "contr.treatment"
#> 
#> attr(,"contrasts")$`age | Subject`
#> [1] "contr.treatment"

The result is a zero row data frame.

The issue is that the special formula has to be processed by the underlying package code, not the standard model.matrix() approach.

Even if this formula could be used with model.matrix(), this would still present a problem since the formula also specifies the statistical attributes of the model.

The solution in workflows is an optional supplementary model formula that can be passed to add_model(). For example, using the previously mentioned strata() function in the survival package, the add_formula() specification provides the bare column names and then the actual formula given to the model is set within add_model():

library(survival)

parametric_model <- 
  surv_reg() %>% 
  set_engine("survival")

parametric_workflow <- 
  workflow() %>% 
  # This formula passes the data along as-is: 
  add_formula(futime + fustat ~ age + rx) %>% 
  add_model(parametric_model, 
            # This formula is given to the model
            formula = Surv(futime, fustat) ~ age + strata(rx))

random_effects <- fit(parametric_workflow, data = ovarian)
random_effects
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: surv_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
#> futime + fustat ~ age + rx
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> Call:
#> survival::survreg(formula = Surv(futime, fustat) ~ age + strata(rx), 
#>     data = data, model = TRUE)
#> 
#> Coefficients:
#> (Intercept)         age 
#>     12.8734     -0.1034 
#> 
#> Scale:
#>   rx=1   rx=2 
#> 0.7696 0.4704 
#> 
#> Loglik(model)= -89.4   Loglik(intercept only)= -97.1
#>  Chisq= 15.36 on 1 degrees of freedom, p= 9e-05 
#> n= 26

Notice how, in the call printed above, the model-specific formula was used.

As an alternative to using two formulas, we could use a straightforward recipe instead:

parametric_workflow <- 
  workflow() %>% 
  add_recipe(recipe(futime + fustat ~ age + rx, data = ovarian)) %>% 
  add_model(parametric_model, 
            formula = Surv(futime, fustat) ~ age + strata(rx))

# No errors:
random_effects <- fit(parametric_workflow, data = ovarian)

There are additional preprocessor interfaces in development that allow tidy specifications of the variables without using a formula.

8.5 Future plans

Currently, a workflow can accept two types of preprocessors: a model formula or a recipe. While recipes are very flexible, they do not encompass everything that a modeling practitioner might want to do to the data prior to modeling. For example, a supervised feature filter might be of interest. Such a filter would screen predictors against the outcome using the training set and use the results to remove the least relevant predictors. In the future, a specification for this type of pre-model operation will be passed to a workflow to include it in the modeling process. This is a critical step that can have a profound effect on model performance. As discussed in the next chapter, a common pitfall in modeling is to exclude this step from the model evaluation process.

There are also operations that might occur after the model is fit. An example of such a post-processor would be cutoff selection for two-class problems. Previously in this chapter, we discussed the idea of modifying the cutoff for a two-class problem. In the future, workflows will be able to attach a custom cutoff that is applied to probabilities after the model fit. Other approaches, such as probability calibration, could also be added as post-processors.

8.6 Chapter summary

In this chapter, you learned that the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes preprocessing steps, such as those outlined in Chapter 6, and operations taken after a model is fit. We introduced a concept called a model workflow that can capture the important components of the modeling process.

For the Ames data, the code used in later chapters is:

library(tidymodels)
data(ames)

ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)