Security vendors – Sophos included – continue touting the benefits of machine learning-based malware analysis. But, as we’ve written in recent weeks, it must be managed properly to be effective. The technology can be abused by bad actors and corrupted by poor data entry.
Sophos data scientists spoke about the challenges and remedies at length during Black Hat USA 2017 and BSidesLV, and have continued to do so. The latest example is an article by data scientist Hillary Sanders about the importance of proper labeling.
Sometimes, says Sanders, the labels companies inject into their models is wrong.
Dirty labels, bad results
As she put it, supervised machine learning works like this:
- Researchers give a model (a function) some data (like some HTML files) and a bunch of associated desired output labels (like 0 and 1 to denote benign and malicious).
- The model looks at the HTML files, looks at the available labels 0 and 1 and then tries to adjust itself to fit the data so that it can correctly guess output labels (0,1) by only looking at input data (HTML files).
- Researchers define the ground truth for the model by telling it that “this is the perfectly accurate state of the world, now learn from it so you can accurately guess labels from new data”.
The problem, she says, is when researchers give their models labels that aren’t correct:
Perhaps it’s a new type of malware that our systems have never seen before and hasn’t been flagged properly in our training data. Perhaps it’s a file that the entire security community has cumulatively mislabeled through a snowball effect of copying each other’s classifications. The concern is that our model will fit to this slightly mislabeled data and we’ll end up with a model that predicts incorrect labels.
To top it off, she adds, researchers won’t be able to estimate their errors properly because they’ll be evaluating their model with incorrect labels. The validity of this concern is dependent on a couple of factors:
- The amount of incorrect labels in a dataset
- The complexity of the model
- If incorrect labels are randomly distributed across the data or highly clustered
In the article, Sanders uses plot charts to show examples of when things can go wrong. Those charts are in the “problem with labels” section.
Getting it right
After guiding readers through the examples of what can go wrong, Sanders outlines what her team does to get it right. To minimize the amount and effects of bad labels in their data, the team…
- Only uses malware samples that have been verified as inherently malicious through sandbox analysis and confirmed by multiple vendors.
- Tries not to overtrain, and thus overfit, their models. “The goal is to be able to detect never-before-seen malware samples, by looking at similarities between new files and old files, rather than just mimic existing lists of known malware,” she says.
- Attempts to improve their labels by analyzing false positives and false negatives found during model testing. In other words, she explains, “we take a look at the files that we think our model misclassified (like the red circled file in the plot below), and make sure it actually misclassified them”.
She adds:
What’s really cool is that very often – our labels were wrong, and the model was right. So our models can actually act as a data-cleaning tool.