Transcend: Detecting Concept Drift in Malware Classification Models

Abstract

Building machine learning models of malware behavior is widely accepted as a panacea towards effective malware classification. A crucial requirement for building sustainable learning models, though, is to train on a wide variety of malware samples. Unfortunately, malware evolves rapidly and it thus becomes hard—if not impossible—to generalize learning models to reflect future, previously-unseen behaviors. Consequently, most malware classifiers become unsustainable in the long run, becoming rapidly antiquated as malware continues to evolve. In this work, we propose Transcend, a framework to identify aging classification models in vivo during deployment, much before the machine learning model’s performance starts to degrade. This is a significant departure from conventional approaches that retrain aging models retrospectively when poor performance is observed. Our approach uses a statistical comparison of samples seen during deployment with those used to train the model, thereby building metrics for prediction quality. We show how Transcend can be used to identify concept drift based on two separate case studies on Android and Windows malware, raising a red flag before the model starts making consistently poor decisions due to out-of-date training.

Publication
In USENIX Security Symposium. Vancouver, BC, Canada.
Date
Links
PDF

Bibtex

@inproceedings {203684,
author = {Roberto Jordaney and Kumar Sharad and Santanu K. Dash and Zhi Wang and Davide Papini and Ilia Nouretdinov and Lorenzo Cavallaro},
title = {Transcend: Detecting Concept Drift in Malware Classification Models},
booktitle = {26th {USENIX} Security Symposium ({USENIX} Security 17)},
year = {2017},
isbn = {978-1-931971-40-9},
address = {Vancouver, BC},
pages = {625--642},
url = {https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/jordaney},
publisher = {{USENIX} Association},
}