Naked Security Naked Security

Garbage in, garbage out: a cautionary tale about machine learning

Security based on machine learning is only as great as the data it feeds on, as Sophos data scientist Hillary Sanders explains at Black Hat 2017

Here’s the thing about machine learning: use the right datasets and it’ll help you root out malware with great accuracy and efficiency. But the models are what they eat. Feed them a diet of questionable, biased data and it’ll produce garbage.

That’s the message Sophos data scientist Hillary Sanders delivered at Black Hat USA 2017 on Wednesday in a talk called “Garbage in, Garbage Out: How Purportedly Great Machine Learning Models Can Be Screwed Up By Bad Data”.

The machine learning movement

A lot of security experts tout machine learning as the next step in anti-malware technology. Indeed, Sophos’ acquisition of Invincea earlier this year was designed to bring machine learning into the fold.

Machine learning is considered a more efficient way to stop malware in its tracks before it becomes a problem for the end user. Some of the high points:

  • Deep learning neural network models lead to better detection and lower false positives.
  • It roots out code that shares common characteristics with known malware, but whose similarities often escape human analysis.
  • Behavioral-based detections provide extensive coverage of the tactics and techniques employed by advanced adversaries.

But it would be dishonest to suggest that machine learning is the silver bullet – the security remedy that can do no wrong. As Sanders noted, no technology is perfect and its creators should always analyze weaknesses and come up with bigger and better models.

Biased data

In her talk, Sanders explained the problem this way:

  1. Model accuracy claimed by security machine learning researchers is always wrong.
  2. It’s almost always biased in an overly optimistic direction.
  3. Estimating the severity of that bias is important, and will help ensure your model isn’t garbage.

She said:

Standard model validation results can be misleading. We want to know how our model is going to actually do in the wild, so we can make sure it doesn’t fail horribly. This is impossible. But we can still estimate. If we have access to an unbiased sample of deployment-like data, we can simulate our model’s deployment errors via time decay analysis. However, if we don’t have access to deployment-like data, then it’s impossible to accurately estimate how well our model will do on deployment, because we don’t have the right data to test it on.

The next best option, she said, is to test how sensitive one’s models are to new datasets they weren’t trained on, and pick training datasets and model configurations that perform consistently well on a variety of test sets, not just the test datasets that originate from the same parent as the model’s training dataset.

That helps give us a sort of very rough ‘confidence interval’ surrounding deployment accuracy, and also improves the likelihood that our model won’t do poorly on deployment.

Minimize the probability of failing spectacularly

Since machine learning in security is still relatively new, there’s no bullet-proof answer to how to root out the garbage. But Sanders suggested some starting points.

In order to select the best training set and best model configuration possible, one must map the limitations of their fitted model so they have a more accurate starting point, she said.

To get a more accurate measurement, Sanders ran Black Hat attendees through some sensitivity results from the same deep learning model designed to detect malicious URLs, trained and tested across three different sources of URL data.

By simulating the errors, we can better develop training datasets and model configurations that are most likely to perform reliably well on deployment, Sanders said.