Part 1—Time and Pools
(The following is Part 1 of a two-part blog series overviewing four components of good and bad quality data for predictive admissions models: time, pools, measurement and coding. Today, we’re going to talk about time and pools.)
To make enrollment management predictions, we use history to inform our algorithmic guesses about the future. If history (at least the past few years) is a good representation of the future (the next year), we can make good predictions.
If history is not a good representation of the future, we will make poor predictions regardless of how well our prediction algorithm performs on historical data. And if history is a good representation of the future, but our data are measured or coded inconsistently between historical and future years, we will also struggle to make good predictions.
Many of the data-quality problems that plague admission data can be attributed to inconsistencies in measurement and storage over time. Enrollment management consultants change (for better or for worse), student populations change, CRM packages switch, admission workers enter and leave, etc. All of these changes can have a negative impact on the quality of future predictions made with historical student record data.
Typically, colleges and universities have consistent data quality within years, but between years, data are often inconsistent. Database fields are added or subtracted, field names change, demographic categories are added or subtracted, category codes change, numeric fields’ ranges change, etc. Inconsistently measured and stored data between years is our biggest data quality challenge.
Student pools fluctuate over time. Some fluctuations do not pose problems for predictive models, but some do. Yearly fluctuations that pose problems are the ones where the selection process that generates the student pool changes each year.
For example, the suspect pool is highly inconsistent from year to year at most schools. Why? For one, most schools source suspects differently every year, so the composition and size of the pool varies dramatically. A highly variable selection process makes it difficult to make predictions about the future for suspects, because what constitutes a suspect changes every year.
Also, considerable inconsistencies in the yearly suspect pool size mean that we will struggle to make accurate predictions about the size of the next year’s applicant pool. But, we know enough about suspects from year to year to make lower-risk predictions about which students are simply more likely to apply to a particular school than other students.
Now, selection for applicant pools is also variable but not nearly as inconsistent as suspect pools. Why? Applicants typically self-select themselves to be applicants, which creates some stability across time in the types of students who apply from year to year. A dramatic example: The student applying to the local community college is probably not also applying to Harvard. And the other students applying to that community college are likely to be similar in demographic, behavioral, and contextual ways.
Also, the relative stability of applicant pools would mean that we could make more reliable predictions about what applicants are likely to do (relative to suspects), but since schools have control over the next stage of the student decision journey (admission), there is little reason to make predictions about what applicants will do.
Selection for the accepted pool is much more consistent than the prior two outcomes. Accepted students passed a rigorous assessment process that distinguished them from students who were dissimilar from them, according to each school’s admission criteria. These criteria are often similar across time. The similarity in selection process and student pools over time enables us to make good predictions about the behavior of accepted students (i.e., will they enroll or not?)
So, consistency over time and selection criteria of student pools are two of the four components of good and bad quality data for predictive models. Next week, in Part 2 of “Data Considerations for Predictive Admission’s Products,” we will discuss two components of data consistency —measurementand coding.
By Pete Barwis, Ph.D., Senior Data Scientist, Capture Higher Ed