Most higher education consultancies rely on a toolbox of prediction methods that belongs to social science. These prediction tools are good for explaining why certain behaviors emerge during admissions cycles. Unfortunately, they are often less well-suited to accurately predicting outcomes in a modern enrollment management context.
In fact, most enrollment management professionals rely on just two predictive algorithms: linear and logistic regression. Linear regression is used to predict a numeric outcome (with a measurement unit like the number of dollars). Logistic regression is used to predict a categorical outcome with two possible values (coded 0 or 1; e.g., “enrolled or not enrolled”).
Linear regression dates back to the 1800s, and logistic regression, which is a special case of the generalized linear model, has been in use since the 1950s. Both models continue to be used extensively today, in spite of their age, because they are reliable, interpretable, and efficient.
However, both of these models assume that an outcome, or a transformation of the outcome, can be expressed as a function of a linear combination of predictor (independent) variables. Sadly, the assumption of a linear relationship between predictors and an outcome rarely holds up well in real life.
It is often the case that the number of times a student visits campus has a different effect on their likelihood of enrolling for recruited athletes than it does for non-athletes, or for first-generation students compared to students whose parents attended college.
The way these conditional relationships are specified in a linear or logistic regression model is to include interaction terms that multiply these predictors together. But this process assumes that the person specifying the model is omniscient of all of the important, conditional relationships present in real life before the model is run. In a complex world, it is unlikely that even a very capable scientist will be able to specify a linear model in such a way that it captures all of the conditional relationships that together, make the best possible prediction of an outcome.
When real life outcomes are complicated, and the relationships of predictors to those outcomes matter conditionally in many different ways, we need a different kind of predictive algorithm. Recent advances in computer science and statistics have produced a wealth of predictive algorithms that can apprehend and utilize non-linear relationships in data to make more accurate predictions that can be made when we assume outcomes are simple linear functions of their predictors.
These machine learning algorithms pick up on relationships in different ways, making some algorithms better suited for certain types of data. There is no one-size-fits-all master algorithm that always works best regardless of what we are predicting. This is termed the “No Free Lunch” theorem. Instead of relying on a single algorithm like the generalized linear model, we need to test different, but well-suited, machine learning algorithms and select the combination (ensemble) that performs best on a particular data set.
Capture’s groundbreaking Envision prediction platform does just that.
The process of fitting numerous machine learning algorithms and testing them all is much slower, more complicated, and less interpretable than running a single, generalized linear model, but it enables us to make more reliable and accurate predictions.
By Pete Barwis, Ph.D., Senior Data Scientist, Capture Higher Ed