12 Model tuning and the dangers of overfitting

Model parameters are unknown values that require estimation in order to use the model. Using simple linear regression again as the example:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Given data on the outcome \(y\) and the predictor \(x\), the two parameters \(\beta_0\) and \(\beta_1\) can be directly estimated from the data using:

\[\hat \beta_1 = \frac{\sum_i (y_i-\bar{y})(x_i-\bar{x})}{\sum_i(x_i-\bar{x})^2}\]

and

\[\hat \beta_0 = \bar{y}-\hat \beta_1 \bar{x}.\] After estimation of these model parameters, the remaining important qualities (e.g., covariance matrix, confidence intervals, etc.) are capable of being calculated. The fact that we have the means to directly estimate these values from the data indicates that there are analytical means of estimating these parameters.

There are situations where a model’s equation contains unknown parameters that can’t be directly estimated from the data. For the \(K\)-nearest neighbors model, the prediction equation for a new value \(x_0\) is

\[\hat y = \frac{1}{K}\sum_{\ell = 1}^K x_\ell^*\] where \(K\) is the number of neighbors and the \(x_\ell^*\) are the \(K\) closest values to \(x_0\) in the training set.

The model itself is not defined by a single equation; the prediction equation shown above defines it. This factor, and the possible intractability of a the distance measure, makes it effectively impossible to create a set of equations that can solved for \(K\) (iteratively or otherwise). The number of neighbors has a profound impact on the model; it governs how flexible the class boundary will be. For small values of \(K\), the class boundary is very elaborate while large values might be unreasonably smooth.

This is a good example of a tuning parameter: an unknown structural value that has significant impact on the model but cannot be directly estimate from the data.

This chapter is strategic in nature. The outline is to:

  • Provide additional examples of tuning parameters.

  • Demonstrate that poor choices of these values can lead to overfitting.

  • Introduce several tactics for finding optimal values for the tuning parameters.

  • Present tidymodels functions for working with tuning parameters.

Subsequent chapters chapters go into more detail for these optimization methods for these values.

12.1 Tuning parameters for different types of models

12.2 What do we optimize?

12.3 The consequences of poor parameter estimates

12.4 Two general strategies for optimization

12.5 Tuning parameters in tidymodels

12.6 Chapter summary