Sophos News

Machine learning for malware: what could possibly go wrong?

A set of vector carton tags with various linen string tying. Color label cards tied with knots and bow of realistic linen material cord illustration.

Security vendors – Sophos included – continue touting the benefits of machine learning-based malware analysis. But, as we’ve written in recent weeks, it must be managed properly to be effective. The technology can be abused by bad actors and corrupted by poor data entry.

Sophos data scientists spoke about the challenges and remedies at length during Black Hat USA 2017 and BSidesLV, and have continued to do so. The latest example is an article by data scientist Hillary Sanders about the importance of proper labeling.

Sometimes, says Sanders, the labels companies inject into their models is wrong.

Dirty labels, bad results

As she put it, supervised machine learning works like this:

The problem, she says, is when researchers give their models labels that aren’t correct:

Perhaps it’s a new type of malware that our systems have never seen before and hasn’t been flagged properly in our training data. Perhaps it’s a file that the entire security community has cumulatively mislabeled through a snowball effect of copying each other’s classifications. The concern is that our model will fit to this slightly mislabeled data and we’ll end up with a model that predicts incorrect labels.

To top it off, she adds, researchers won’t be able to estimate their errors properly because they’ll be evaluating their model with incorrect labels. The validity of this concern is dependent on a couple of factors:

In the article, Sanders uses plot charts to show examples of when things can go wrong. Those charts are in the “problem with labels” section.

Getting it right

After guiding readers through the examples of what can go wrong, Sanders outlines what her team does to get it right. To minimize the amount and effects of bad labels in their data, the team…

She adds:

What’s really cool is that very often – our labels were wrong, and the model was right. So our models can actually act as a data-cleaning tool.