News and Blog

Error and robustness: “in-sample” and “out-of-sample” metrics

Post giribone 3
AI in banking managementNews

Error and robustness: “in-sample” and “out-of-sample” metrics

Edited by Pier Giuseppe Giribone

One of the primary causes of failure of the generalization process is theoverfitting, a statistical concept that was widely known even before the widespread use of machine learning methodologies. The following example proposes a criterion for measuring the validity of a model, focusing on the division of the dataset into train e test. Consider, as an example, a regression that best represents the law governing the process described by the experimental points shown in Figure 1.

Figure 1 – Polynomial regression models compared

Figure 1 shows 40 points not described by a known law. The blue color associated with the entire dataset means that 100% of the sample will be used to train the interpretive models.

In particular, three traditional polynomial regression models are taken into consideration in order to better interpret the law that generated them:

– The blue line is obtained from a linear model, characterized by two parameters that must be estimated from the training dataset (the blue points).

– The orange line is generated by a quadratic model, characterized by three parameters to be estimated.

– the green line was drawn starting from the results obtained from a fiftieth-order polynomial model and, consequently, equipped with a very high number of parameters.

The model represented by the green line is intuitively characterized by a problem of over-fitting the data, the one represented by the blue line is too poor in terms of fit to the data (underfitting), while the orange line is the one that, even visually, best captures the essence of the law that generated the data.

The quadratic model is therefore the one that best generalizes the relationship inherent in the data. The real question is how this insight can be transmitted to a computer.

A key concept lies in defining a statistical measure that allows us to compare theerror of the model with the experimentally observed one, or a sort of evaluation of the interpretative gap.

Among the most popular measures that can be used for regressive approaches, two of the most widely used are mentioned:

- Mean Absolute Error (MAE): defined as the sum of the errors taken in absolute value divided by the number of elements present in the sample.

- Mean Squared Error (MSE): defined as the mean of the squared errors.

If we directly applied this measure to the entire training batch, the best model would be mistakenly the overfit one, presenting the lowest MAE and MSE compared to the other approaches.

Sample performance

Underfit model: MAE = 7.28, MSE = 75.99

Model correct fit: MAE = 2.10, MSE = 6.72

Model overfit: MAE = 0.35, MSE = 0.36

To correctly identify the best model, these statistical tests must be conducted on data not considered in the training set. This portion of data, not considered by the algorithm during training, is defined as test set.

The proposed procedure would be to estimate the model not for all 40 experimental data, but to exclude from the training batch a portion (for example 15%) on which to estimate performance statistics. out-of-sample, more suitable for evaluating the quality of our algorithm.

Figure 2 shows the data selected for training the models in blue, while the data to be used as a comparison is in orange. The split between training and testing must be random.

Figure 2 Train – splitting test

By estimating the performance of the three models with the new training dataset (blue dots in Figure 2), we obtain results similar to the previous ones:

In-Sample performance

Underfit model: MAE = 7.04, MSE = 71.28

Model correct fit: MAE = 1.87, MSE = 5.77

Model overfit: MAE = 0.0009, MSE = 0.000000172 

Even the overfit model has virtually zero error measures. Let's now evaluate the same metrics applied to the test points.

By testing models with data not considered in the training phase, the instability of the overfitting model is discovered.

Out-of-Sample performance

Underfit model: MAE = 8.56, MSE = 103.09

Model correct fit: MAE = 3.31, MSE = 12.6

Overfit model: MAE = 10e+9, MSE = 10e+16 

In summary, statistical measurements conducted on “new” data allow for a reliable and independent external measure of a model’s performance.

Underfit models are characterized by high in-sample and out-of-sample errors, while overfit models have extremely low in-sample errors and extremely high (or unstable) out-of-sample errors. Correct models have good and stable performance for both in-sample and out-of-sample errors.

The challenge is therefore to find the number of parameters that allows an optimal trade-off between model stability and in-sample, but above all out-of-sample, performance.

In the case of the polynomial regression just discussed, using a polynomial of a degree very close to the number of experimental data, one was confident that one could find a function that mathematically fit the points perfectly.

But by doing so, the most important concept of a statistical model was completely lost, that is, ability to generaliseThe overfitted model proved to be unstable in the vicinity of the experimental points, significantly increasing the calculated error.

The phenomenon ofoverfitting, as just demonstrated using classical econometric models, is not new in traditional statistics, but it plays a truly critical role in Machine Learning in general and in deep neural networks in particular, i.e. where the number of model hyperparameters is high.

Select the fields to be shown. Others want to be hidden. Drag and drop to rearrange the order.
  • Image
  • SKU
  • Rating
  • Price
  • Stock
  • Availability
  • Add to Cart
  • Description
  • Content
  • Weight
  • Size
  • Product information
Click outside to hide the comparison bar
Compare