11 Comparing models with resampling

Once we create two or more models, the next step is to compare them. In some cases, comparisons might be within-model, where the same model might be evaluated with different features or preprocessing methods. Alternatively, between-model comparisons, such as when we compared linear regression and random forest models in Chapter 10, are the more common scenario.

In either case, the result is a collection of resampled summary statistics (e.g. RMSE, accuracy, etc.) for each model. The first section of this chapter discusses important aspects of these statistics. Two additional sections follow describing how to formally compare models.

11.1 Resampled performance statistics

In Chapter 10, a random forest model for the Ames data was resampled with 10-fold cross-validation. These results were saved in the object rf_res. See Section 10.6 for a summary of the code used so far.

We’ll estimate two additional models with the same resamples. First, the previous linear regression model, using the preprocessing defined in the ames_rec recipe, is resampled:

lm_with_splines_res <-
  lm_wflow %>%
  fit_resamples(resamples = ames_folds, control = keep_pred)

Secondly, a less complex recipe without the splines terms is evaluated to determine if the extra 40 spline terms are worth keeping. Once added to a workflow, it is resampled:

lm_with_splines_res <-
  lm_wflow %>%
  fit_resamples(resamples = ames_folds, control = keep_pred)

no_spline_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  # Recall that Sale_Price is pre-logged
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) 

lm_no_splines_res <- 
  lm_wflow %>%
  remove_recipe() %>% 
  add_recipe(no_spline_rec) %>% 
  fit_resamples(resamples = ames_folds, control = keep_pred)

The difference in performance appears relatively minor:

#> # A tibble: 2 x 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.0782    10 0.00192 Preprocessor1_Model1
#> 2 rsq     standard   0.802     10 0.0113  Preprocessor1_Model1
#> # A tibble: 2 x 6
#>   .metric .estimator   mean     n std_err .config             
#>   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.0771    10 0.00211 Preprocessor1_Model1
#> 2 rsq     standard   0.808     10 0.0115  Preprocessor1_Model1

Considering these results, it appears that the additional terms do not profoundly improve the mean RMSE or R2 statistics. The difference is small, but it might be larger than the experimental noise in the system, i.e., considered statistically significant. We can formally test the hypothesis that the additional terms increase R2.

Before comparing models, it is important to discuss the within-resample correlation for resampling statistics. Each model was measured with the same cross-validation folds, and results for the same resample tend to be similar.

In other words, there are some resamples where performance across models tends to be low and others where it tends to be high. In statistics, this is called a resample-to-resample component of variation.

To illustrate, let’s gather the individual resampling statistics for the two linear models and the random forest. We will focus on the R2 statistic for each model, which measures correlation.

no_splines_rsq <- 
  collect_metrics(lm_no_splines_res, summarize = FALSE) %>% 
  filter(.metric == "rsq") %>% 
  select(id, `no splines` = .estimate)

splines_rsq <- 
  collect_metrics(lm_with_splines_res, summarize = FALSE) %>% 
  filter(.metric == "rsq") %>% 
  select(id, `with splines` = .estimate)

rf_rsq <- 
  collect_metrics(rf_res, summarize = FALSE) %>% 
  filter(.metric == "rsq") %>% 
  select(id, `random forest` = .estimate)  

rsq_estimates <- 
  inner_join(no_splines_rsq, splines_rsq, by = "id") %>% 
  inner_join(rf_rsq, by = "id") 

corrr::correlate(rsq_estimates %>% select(-id))
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 3 x 4
#>   term          `no splines` `with splines` `random forest`
#>   <chr>                <dbl>          <dbl>           <dbl>
#> 1 no splines          NA              0.977           0.881
#> 2 with splines         0.977         NA               0.845
#> 3 random forest        0.881          0.845          NA

These high correlations indicate that, across models, there are large within-resample correlations. To see this visually, the R2 statistics are shown for each model with lines connecting the resamples:

rsq_estimates %>% 
  pivot_longer(cols = c(-id), names_to = "model", values_to = "rsq") %>% 
  mutate(model = reorder(model, rsq)) %>% 
  ggplot(aes(x = model, y = rsq, group = id, col = id)) + 
  geom_line(alpha = .5, lwd = 1.25) + 
  theme(legend.position = "none") + 
  labs(x = NULL, y = expression(paste(R^2, "statistics")))

If the resample-to-resample effect was not real, there would not be any parallel lines. A statistical test for the correlations evaluates whether the magnitudes of these correlations are not simply noise. For the linear models:

rsq_estimates %>% 
  with( cor.test(`no splines`, `random forest`) ) %>% 
  tidy() %>% 
  select(estimate, starts_with("conf"))
#> # A tibble: 1 x 3
#>   estimate conf.low conf.high
#>      <dbl>    <dbl>     <dbl>
#> 1    0.881    0.565     0.972

The within-resample correlation appears to be real.

What effect does the extra correlation have on our analysis? Consider the variance of a difference of two variables:

\[Var[X - Y] = Var[X] + Var[Y] - 2 Cov[X, Y]\]

The last term is the covariance between two items. If there is a significant positive covariance, then any statistical test of this difference would be critically under-powered comparing the difference in two models. In other words, ignoring the resample-to-resample effect would bias our model comparisons towards finding no differences between models.

This characteristic of resampling statistics will come into play in the next two sections.

Before making model comparisons or looking at the resampling results, it can be helpful to define a relevant practical effect size. Since these analyses focus on the R2 statistics, the practical effect size is the change in R2 that we would consider to be a realistic difference that matters. For example, we might think that two models are not practically different if their R2 values are within \(\pm 2\)%. If this were the case, differences smaller than 2% are not deemed important even if they are statistically significant.

Practical significance is subjective; two people can have very different ideas on the threshold for importance. However, as shown later, this consideration can be very helpful when deciding between models.

11.2 Simple hypothesis testing methods

Consider the familiar linear statistical model:

\[y_{ij} = \beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip} + \epsilon_{ij}\]

This versatile model is used to create regression models as well as being the basis for the popular analysis of variance (ANOVA) technique for comparing groups. With the ANOVA model, the predictors (\(x_{ij}\)) are binary dummy variables for different groups. From this, the \(\beta\) parameters estimate whether two or more groups are different from one another using hypothesis testing techniques.

In our specific situation, the ANOVA can also make model comparisons. Suppose the individual resampled R2 statistics serve as the the outcome data (i.e., the \(y_{ij}\)) and the models as the predictors in the ANOVA model. A sampling of this data structure would be:

Y = rsq model X1 X2 id
0.8450 no splines 0 0 Fold01
0.8566 with splines 1 0 Fold01
0.8848 random forest 0 1 Fold01
0.7811 no splines 0 0 Fold02
0.7852 with splines 1 0 Fold02
0.8433 random forest 0 1 Fold02

The X1 and X2 columns in the table are indicators based on the values in the model column.

For our model comparison, the specific ANOVA model is:

\[y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_{ij}\]


  • \(\beta_0\) is the estimate of the mean R2 statistic for linear models without splines,

  • \(\beta_1\) is the change in mean R2 when splines are added to the linear model, and

  • \(\beta_2\) is the change in mean R2 between a linear model without splines and the random forest model.

From these model parameters, hypothesis tests and p-values are generated to statistically compare models, but we must contend with how to handle the resample-to-resample effect. Historically, the resample groups would be considered a block effect and an appropriate term was added to the model. Alternatively, the resample effect could be considered a random effect where these particular resamples were drawn at random from a larger population of possible resamples. However, we aren’t really interested in these effects; we only want to adjust for them in the model so that the variances of the interesting differences are properly estimated.

Treating the resamples as random effects is theoretically appealing. Methods for fitting an ANOVA model with this type of random effect could include the linear mixed model (Faraway 2016) or a Bayesian hierarchical model (shown in the next section).

A simple and fast method for comparing two models at a time is to use the differences in R2 values as the outcome data in the ANOVA model. Since the outcomes are matched by resample, the differences do not contain the resample-to-resample effect and, for this reason, the standard ANOVA model is appropriate. To illustrate, this call to lm() tests the difference between the two linear regression models:

compare_lm <- 
  rsq_estimates %>% 
  mutate(difference = `with splines` - `no splines`)

lm(difference ~ 1, data = compare_lm) %>% 
  tidy(conf.int = TRUE) %>% 
  select(estimate, p.value, starts_with("conf"))
#> # A tibble: 1 x 4
#>   estimate p.value conf.low conf.high
#>      <dbl>   <dbl>    <dbl>     <dbl>
#> 1  0.00587  0.0393 0.000360    0.0114

# Alternatively, a paired t-test could also be used: 
rsq_estimates %>% 
  with( t.test(`with splines`, `no splines`, paired = TRUE) ) %>%
  tidy() %>% 
  select(estimate, p.value, starts_with("conf"))
#> # A tibble: 1 x 4
#>   estimate p.value conf.low conf.high
#>      <dbl>   <dbl>    <dbl>     <dbl>
#> 1  0.00587  0.0393 0.000360    0.0114

We could evaluate each pair-wise difference in this way. Note that the p-value indicates a statistically significant signal; the collection of spline terms for longitude and latitude do appear to have an effect. However, the difference in R2 is estimated at 0.59%. If our practical effect size were 2%, we might not consider these terms worth including in the model.

What’s a p-value? From Wasserstein and Lazar (2016): “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”

In other words, if this analysis were repeated a large number of times under the null hypothesis of no differences, the p-value reflects how extreme our observed results would be in comparison.

The next section takes a more general approach to making formal comparisons between models using random effects and Bayesian statistics (McElreath 2020). While the model is more complex than the ANOVA method shown above, the interpretation is more simple and straight-forward than the p-value approach.

11.3 Bayesian methods

The previous ANOVA model for these data had the form:

\[y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_{ij}\]

where the residuals \(\epsilon_{ij}\) are assumed to be independent and follow a Gaussian distribution with zero mean and constant standard deviation of \(\sigma\). From this assumption, statistical theory shows that the estimated regression parameters follow a multivariate Gaussian distribution and, from this, p-values and confidence intervals are derived.

A Bayesian linear model makes additional assumptions. In addition to specifying a distribution for the residuals, we require prior distribution specifications for the model parameters ( \(\beta_j\) and \(\sigma\) ). These are distributions for the parameters that the model assumes before being exposed to the observed data. For example, a simple set of prior distributions for our model might be:

\[ \begin{align} \epsilon_{ij} &\sim N(0, \sigma) \notag \\ \beta_j &\sim N(0, 10) \notag \\ \sigma &\sim \text{exponential}(1) \notag \end{align} \]

These priors set the possible/probable ranges of the model parameters and have no unknown parameters. For example, the prior on \(\sigma\) indicates that values must be larger than zero, are very right-skewed, and have values that are usually less than 3 or 4.

Note that the regression parameters have a pretty wide prior distribution, with a standard deviation of 10. In many cases, we might not have a strong opinion about the prior beyond it being symmetric and bell shaped. The large standard deviation implies a fairly uninformative prior; it is not overly restrictive in terms of the possible values that the parameters might take on. This allows the data to have more of an influence during parameter estimation.

Given the observed data and the prior distribution specifications, the model parameters can then be estimated. The final distributions of the model parameters are combinations of the priors and the maximum likelihood estimates. These posterior distributions of the parameters are the key distributions of interest. They are a full probabilistic description of the model’s estimated parameters.

To adapt our Bayesian ANOVA model so that the resamples are adequately modeled, we consider a random intercept model. Here, we assume that the resamples impact the model only by changing the intercept. Note that this constrains the resamples from having a differential impact on the regression parameters \(\beta_j\); these are assumed to have the same relationship across resamples. This model equation is:

\[y_{ij} = (\beta_0 + b_{i}) + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_{ij}\]

This is not an unreasonable model for resampled statistics which, when plotted across models (as above), tend to have fairly parallel effects across models (i.e., little cross-over of lines).

For this model configuration, an additional assumption is made for the prior distribution of random effects. A reasonable approach is another symmetric distribution such as another bell-shaped curve. Given the effective sample size of 10 in our data, let’s use a prior that is wider than a standard normal distribution. We’ll use a t-distribution with a single degree of freedom (i.e. \(b_i \sim t(1)\)), which has heavier tails than an analogous Gaussian distribution.

The tidyposterior package has functions to fit such Bayesian models for the purpose of comparing resampled models. For an object created using the rsample package (such as our ames_folds tibble), the corresponding resampling estimates of performance are in columns of the data frame. From this object, the perf_mod() function determines an appropriate Bayesian model and fits using the resampling statistics. For example, let’s fit a model to the three sets of model R2 statistics:

# Bring the R^2 values into the original rsample object:
ames_two_models <- 
  ames_folds %>% 
  bind_cols(rsq_estimates %>% arrange(id) %>% select(-id))

ames_two_models %>% slice(1:4)
#> # A tibble: 4 x 5
#>   splits           id     `no splines` `with splines` `random forest`
#>   <list>           <chr>         <dbl>          <dbl>           <dbl>
#> 1 <split [2K/220]> Fold01        0.845          0.857           0.885
#> 2 <split [2K/220]> Fold02        0.781          0.785           0.843
#> 3 <split [2K/220]> Fold03        0.740          0.756           0.793
#> 4 <split [2K/220]> Fold04        0.835          0.838           0.864

The tidyposterior package uses the Stan software for specifying and fitting the models via the rstanarm package. The functions within that package have default priors (see ?priors for more details). The model below uses the default priors for all parameters except for the random intercepts (which follow a t-distribution). The estimation process uses random numbers so the seed is set within the function call. The estimation process is iterative and is replicated several times in collections called chains. The iter parameter tells the function how long to run the estimation process in each chain. When several chains are used, their results are combined (assume that this is validated by diagnostic assessments).


# The rstanarm package creates copious amounts of output; those results
# are not shown here but are worth inspecting for potential issues. 
rsq_anova <-
    prior_intercept = student_t(df = 1),
    chains = 4,
    iter = 5000,
    seed = 2

The resulting object has information on the resampling process as well as the Stan object embedded within (in an element called stan). We are most interested in the posterior distributions of the regression parameters. The package has a tidy() method that extracts these posterior distributions into a tibble:

model_post <- 
  rsq_anova %>% 
  # Take a random sample from the posterior distribution
  # so set the seed again to be reproducible. 
  tidy(seed = 35) %>% 
#> Instead of posterior_linpred(..., transform=TRUE) please call posterior_epred(), which provides equivalent functionality.

#> Rows: 30,000
#> Columns: 2
#> $ model     <chr> "no splines", "no splines", "no splines", "no splines", "no spl…
#> $ posterior <dbl> 0.8096, 0.8073, 0.8115, 0.8179, 0.8131, 0.8114, 0.8077, 0.8077,…

The three posterior distributions are:

model_post %>% 
  mutate(model = fct_inorder(model)) %>%
  ggplot(aes(x = posterior)) + 
  geom_histogram(bins = 50, col = "white", fill = "blue", alpha = 0.4) + 
  facet_wrap(~ model, ncol = 1) + 
  labs(x = expression(paste("Posterior for mean ", R^2)))

These histograms describe the estimated probability distributions of the mean R2 value for each model. There is some overlap, especially for the two linear models.

One wonderful aspect of using resampling with Bayesian models is that, once we have the posteriors for the parameters, it is trivial to get the posterior distributions for combinations of the parameters. For example, to compare the two linear regression models, we are interested in the difference in means. The posterior of this difference is computed by sampling from the individual posteriors and taking the differences. The contrast_models() function can do this. To specify the comparisons to make, the list_1 and list_2 parameters take character vectors and compute the differences between the models in those lists (parameterized as list_1 - list_2).

To compare the linear models:

rqs_diff <-
                  list_1 = "with splines",
                  list_2 = "no splines",
                  seed = 36)
#> Instead of posterior_linpred(..., transform=TRUE) please call posterior_epred(), which provides equivalent functionality.
#> Instead of posterior_linpred(..., transform=TRUE) please call posterior_epred(), which provides equivalent functionality.

rqs_diff %>% 
  as_tibble() %>% 
  ggplot(aes(x = difference)) + 
  geom_vline(xintercept = 0, lty = 2) + 
  geom_histogram(bins = 50, col = "white", fill = "red", alpha = 0.4) + 
  labs(x = expression(paste("Posterior for mean difference in ", R^2, 
                            " (splines - no splines)")))

The posterior shows that the center of the distribution is greater than zero (indicating that the model with splines had larger values) but does overlap with zero to a degree. The summary() method for this object computes the mean of the distribution as well as credible intervals.

summary(rqs_diff) %>% 
#> # A tibble: 1 x 6
#>   contrast                   probability    mean    lower  upper  size
#>   <chr>                            <dbl>   <dbl>    <dbl>  <dbl> <dbl>
#> 1 with splines vs no splines       0.853 0.00585 -0.00367 0.0151     0

The probability column reflects the proportion of the posterior that is greater than zero. This is the probability that the positive difference is real. The value is large, providing a strong case for statistical significance.

However, the estimate of the mean difference is fairly close to zero. Recall that the practical effect size we suggested previously is 2%. With a posterior distribution, the probability of being practically significant is also computed. In Bayesian analysis, this is a “ROPE estimate” (for Region Of Practical Equivalence, Kruschke and Liddell (2018)). To estimate this, the size option to the summary function is used:

summary(rqs_diff, size = 0.02) %>% 
  select(contrast, starts_with("pract"))
#> # A tibble: 1 x 4
#>   contrast                   pract_neg pract_equiv pract_pos
#>   <chr>                          <dbl>       <dbl>     <dbl>
#> 1 with splines vs no splines         0       0.990    0.0097

The pract_equiv column is the proportion of the posterior that is within [-size, size]. This large value indicates that, for our effect size, there is an overwhelming probability that the two models are practically the same.

The same process can be followed to compare the random forest model to one or both of the linear regressions that were resampled.

How does the number of resamples affect this analysis? More resamples increases the precision of the overall resampling estimate; that precision propagates to this type of analysis. For illustration, additional resamples were added using repeated cross-validation. How did the posterior distribution change? The figure below shows the 90% credible intervals with up to 100 resamples (generated from 10 repeats of 10-fold cross-validation):

The width of the intervals decreases as more resamples are added. Clearly, going from ten resamples to thirty has a larger impact than going from eighty to 100. There are diminishing returns for using a “large” number of resamples (“large” will be different for different data sets).

11.4 Chapter summary

This chapter describes formal statistical methods for testing differences in performance between models. We demonstrated the within-resample effect, where results for the same resample tend to be similar; this aspect of resampled summary statistics requires appropriate analysis in order for valid model comparisons. Further, although statistical significance and practical significance are both important concepts for model comparisons, they are different.


Faraway, J. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC press.

Kruschke, J, and T Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin and Review 25 (1): 178–206.

McElreath, R. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC press.

Wasserstein, R, and N Lazar. 2016. “The ASA Statement on P-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33.