Malicious software (malware) represents one of the most pressing threats affecting the security of the Internet and its users, and the need for automated learning-based approaches has become rapidly clear. Machine learning has long been acknowledged as a promising technique to identify and classify malware threats; such a powerful technique is unfortunately often seen as a black-box panacea, and results—often performed only in lab settings—are taken without questioning their quality. Even worse, little effort is made to understand whether deployed machine learning algorithms decay in real-world settings, which further blurs the effectiveness of such approaches in a context where data distribution and classes change steadily. As a first step towards addressing these shortcomings, we propose conformal evaluator (CE), a framework to assess the quality of machine learning tasks. In particular, CE defines statistical metrics to build assessment analyses that measure, for a given algorithm under evaluation (AUE), statistical data distribution according to the AUE, and statistical confidence of the AUE choices. When effort is initially spent on designing and evaluating machine learning in lab settings, such analyses offer the opportunity to see overlapping classes and understand how data points are distributed around decision regions and boundaries (generalization problem). In addition, CE's analyses enables addressing concept drift in real-world settings, where poor statistical quality allows to identify decay in the machine learning tasks (e.g., new behavior or malware family), which may suggest further model re-training or feature re-engineering.