TY - JOUR
T1 - Task-specific information outperforms surveillance-style big data in predictive analytics
AU - Bjerre-Nielsen, Andreas
AU - Kassarnig, Valentin
AU - Lassen, David Dreyer
AU - Lehmann, Sune
PY - 2021/4/6
Y1 - 2021/4/6
N2 - Increasingly, human behavior can be monitored through the collection of data from digital devices revealing information on behaviors and locations. In the context of higher education, a growing number of schools and universities collect data on their students with the purpose of assessing or predicting behaviors and academic performance, and the COVID-19-induced move to online education dramatically increases what can be accumulated in this way, raising concerns about students' privacy. We focus on academic performance and ask whether predictive performance for a given dataset can be achieved with less privacy-invasive, but more task-specific, data. We draw on a unique dataset on a large student population containing both highly detailed measures of behavior and personality and high-quality third-party reported individual-level administrative data. We find that models estimated using the big behavioral data are indeed able to accurately predict academic performance out of sample. However, models using only low-dimensional and arguably less privacyinvasive administrative data perform considerably better and, importantly, do not improve when we add the high-resolution, privacy-invasive behavioral data. We argue that combining big behavioral data with "ground truth" administrative registry data can ideally allow the identification of privacy-preserving taskspecific features that can be employed instead of current indiscriminate troves of behavioral data, with better privacy and better prediction resulting.
AB - Increasingly, human behavior can be monitored through the collection of data from digital devices revealing information on behaviors and locations. In the context of higher education, a growing number of schools and universities collect data on their students with the purpose of assessing or predicting behaviors and academic performance, and the COVID-19-induced move to online education dramatically increases what can be accumulated in this way, raising concerns about students' privacy. We focus on academic performance and ask whether predictive performance for a given dataset can be achieved with less privacy-invasive, but more task-specific, data. We draw on a unique dataset on a large student population containing both highly detailed measures of behavior and personality and high-quality third-party reported individual-level administrative data. We find that models estimated using the big behavioral data are indeed able to accurately predict academic performance out of sample. However, models using only low-dimensional and arguably less privacyinvasive administrative data perform considerably better and, importantly, do not improve when we add the high-resolution, privacy-invasive behavioral data. We argue that combining big behavioral data with "ground truth" administrative registry data can ideally allow the identification of privacy-preserving taskspecific features that can be employed instead of current indiscriminate troves of behavioral data, with better privacy and better prediction resulting.
KW - Academic performance
KW - Big data
KW - Prediction
KW - Privacy
U2 - 10.1073/pnas.2020258118
DO - 10.1073/pnas.2020258118
M3 - Journal article
C2 - 33790010
AN - SCOPUS:85103745351
VL - 118
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
SN - 0027-8424
IS - 14
M1 - e2020258118
ER -