We have chosen a difficult prediction problem. We predict the future in an ever-changing market, using inconsistent institutional data that were collected for other purposes. Let’s think of some of the ways this process could fail …
- Things change next year because of new admissions policies.
- Things change next year because of something that occurred over the past few years but was not collected in institutional data.
- Things change next year because we actively help institutions improve in new ways that are not measured historically.
- Things change next year because the industry as a whole changes.
- We could keep going…
These reasons why future predictions could fail fall in the category of irreducible error. Irreducible error is what we call the inherent variability in what we are trying to predict. Things in the system will change and we will not have measured or modeled those things, making for not-error-free predictions.
Somewhat fortunately for us, our competitors face the same problem. Due to changes they couldn’t measure or anticipate, our competitors will always be at risk of making bad predictions. But for the same reasons, we are also at risk of making inaccurate predictions.
If we can’t make perfect predictions thanks to irreducible error, how do we make better, imperfect predictions?
Unlike irreducible error, we can actively work to minimize the reducible error in our predictions. Sources of error that are reducible are called bias and variance.
Bias is the amount that an average prediction differs from the correct value. Variance is a measure of the inconsistency of a prediction made from an algorithm trained on different datasets. Ideally, we want low bias and low variance, which would mean that our predictions don’t differ much from the true value, and our algorithm does well at identifying the true relationship between the target outcome and its predictors.
However, at a certain point, a trade-off arises in our efforts to reduce both types of error. It can be difficult to achieve both low bias and low variance.
When we train a model using historical data, we teach it to recognize patterns that are specific to those data. When we train it intensely on one specific set of historical data, and don’t limit the model’s complexity, it will become very good at predicting (with low bias) those same data.
But when we apply that same, highly complex model to data it has not seen before, that model may make inconsistent predictions (with high variance), because the patterns in the new data are slightly different. This is called overfitting. The model understands, too well, the patterns in the data it was trained on, and the model does not generalize well to new data.
To correct for overfitting, we actually need to make the model slightly less predictive of the data it was trained on, so it will be more flexible and better able to accurately predict similar data it has not seen before. We may need to allow more bias to reduce the variance in our predictions, which often involves limiting the model’s complexity.
The opposite problem, underfitting, arises when a model makes a bunch of similar predictions (low variance), but they are all wrong (high bias). This occurs when the model is not sufficiently complex to capture the relationship between the target outcome and the predictive data. In this case, we need to more predictive data or fit a different type of algorithm to the data that can better model this relationship.
Why are our predictions “less wrong” than our competitors?
Ultimately, our predictions are “less wrong” because of our ability to construct a model based on how well it reduces the bias and variance in the predicted error.
Our competitors build causal inference models by intentionally including specific predictors that they believe actually cause the outcome to occur. This is called theoretical model specification, and it is well equipped to determine which and how variables affect the outcome. Analysts theorize that particular relationships matter, and they build the model accordingly. Period.
But, no matter how smart our competitors’ analysts are, their thoughts alone can’t assess the bias and variance in their model. Since they “know” what variables cause the outcome to occur before they run the model, there is no need to re-specify the model to take reducible error into account.
At Capture, we use empirical model specification, which means that predictive variable selection, algorithm selection, and ensemble model selection are performed based on how well they minimize the reducible error. This is our competitive advantage. Our models are built and assessed by how well they actually make predictions, while our competitors’ models are built and assessed by how well they think they cause the outcome.
If you had to act on a prediction, which type would you want?
By Pete Barwis, Ph.D., Senior Data Scientist, Capture Higher Ed