Part 2—Measurement and Coding
The following is Part 2 of a two-part blog series overviewing four components of good and bad quality data for predictive models: time, pools, measurement and coding. Today, we’re going to talk about Measurement and Coding. Go here to read Part 1—Time and Pools.
Measurement is the assignment of a value to an object or event. The perceptions and needs of colleges and universities change over time, and so does how they assign values to students.
Take race, for example. As the thinking on racial categories changes over time, schools adjust how they define and collect race data. While five categories may have been sufficient in 2005, in 2015, race might be considered one or more of a set of 25 or more categories.
As measurement changes over time, our ability to make predictions is impacted. If we knew the racial composition in 2005 as a five-category measure—African American, 15 percent; Asian, 10 percent; Hispanic, 8 percent; white, 60 percent; and other, 7 percent—and we tried to predict the 2006 distribution with only that information, what would we guess the categories and percentages would be?
If all we knew were the 2005 values, we would just guess those values again for 2006. But what if we expanded the race categories in 2006 to 25 categories and added an option to pick more than one race? We would probably do a pretty bad job predicting the composition of the new set of categories, because we have no comparable historical information to draw on.
As measurement changes from year to year, those changes can render the things being measured largely useless, or even harmful, for prediction, unless the changes over time can be reduced to a least common denominator.
Say we measured campus visit as whether or not a student visited campus from 2012-14, but in 2015, we measured the number of times a student visited campus. We could reduce the number of times visited in 2015 to simply whether or not those students visited at all. This would make it comparable over time and useful for prediction.
Alternatively, if we counted a campus visit as attending a formal campus tour from 2012-2014, but in 2015, we started also counting self-guided tours and meetings with faculty as visits, then that measurement change will pose problems for predicting enrollment behavior, because the operational definition of a campus visit changed between years.
Predictive algorithms cannot distinguish between changes in how behavior is measured and changes in real life behavior. If history shows that a visit makes it three times more likely a student will enroll, but suddenly the number of visits increases by 30 percent due to a change in measurement, we will over-predict the number of enrollments in the next year.
The most common problem that occurs between years of student data is a change in the way the same phenomena are coded. When a school changes their CRM, many of the field/column names also change. This makes appending those columns across time difficult for the school and also for us.
Even when schools don’t change their CRM provider, they may redefine categories in some columns or change how columns are stored. Changes between years might include minor differences in coding like ‘M’ and ‘F’ to ‘Male’ and ‘Female’ for gender, or more substantial changes like exchanging the type of date format in column. These sorts of differences between years require an analyst to review the differences and make manual changes to the data to make it comparable between years.
The most time-consuming and manual part of making predictions is data cleaning: converting data from a relational database, or from separate relational databases, into a single analytical table that can be used by a predictive algorithm.
In short, good quality data for prediction are data that come from comparable student pools, are measured the same way, and are coded similarly across time.
By Pete Barwis, Ph.D., Senior Data Scientist, Capture Higher Ed