The following is by Sophos Chief Data Scientist Joshua Saxe, who recently came to us with the acquisition of Invincea:
Sophos acquired security machine learning company Invincea this year, and we are integrating its capabilities into our products. But many security vendors – traditional and next generation – employ machine learning in their products already. What makes our approach different and, more importantly, how much better does our machine learning approach perform?
I’m going to show how and why Sophos uses deep neural networks – deep learning – to block threats, contrasting against older, conventional machine learning approaches used in other products. I’ll also look at how our unique approach in deep learning leads to superior detection rates, a lighter resource footprint and how we are strongly positions to maintain our technological leadership in the security machine learning space.
Deep learning Vs conventional machine learning
As mentioned, we are uniquely focused on a deep learning approach to security data science. In contrast, most security companies, regardless of whether they’re considered traditional or next-gen, tend to focus on decision tree algorithms to detect malware and malicious behavior. Machine learning decision trees use well-understood methods developed in the 1990s for detecting cyber attacks. These are relatively easy to use and tune, and provide adequate results.
This is an example of how a decision tree created by a machine learning algorithm might detect whether a binary is malicious. The decision tree is playing a “20 questions” game to detect malware. This approach often works reasonably well, but is fundamentally less capable than deep learning. But, why?
Deep learning automates the identification process
While decision tree and other conventional machine learning approaches do “learn” from data, these approaches require that features of binaries, log files and network flows be manually selected for training and then detection.
For example, in the decision tree image above, a human selected the features “seen on 100 network workstations”, etc., allowing the learning algorithm to decide where to ask each question to make a detection decision.
This manual selection process is where sub-optimal results occur. Often the human is using human intuition on which features are important.
In contrast, deep learning automatically identifies optimal features using learning methods inspired by the brain. Because of this difference, deep neural networks have overtaken conventional machine learning across the technology landscape, which is the major reason why machine learning heavyweights like Google, Facebook, Microsoft, Baidu, Amazon and others are abandoning older approaches and investing heavily in this technique.
To understand how deep learning works let’s start with an example from computer vision:
A deep neural network can learn complex features and identify objects in images after being trained only on raw image pixels.
The figure shows how a neural network is able to detect lines, then car parts and then car makes and models, automatically and organically through its internal learning process. Interestingly, neuroscientists and computer scientists have shown that this process of hierarchical pattern recognition occurs in a strikingly similar way in actual animal brains.
Now contrast the way deep neural networks learn to recognize features organically to the old way computer vision systems recognized objects by looking at the figure below:
In contrast to deep neural networks, the object detector in the figure does feature detection as the result of manual tinkering by human engineers.
Don’t worry about understanding the details though. Just understand that every box in the image represents a manually designed subsystem, leading to terrible system complexity and, most importantly, a less accurate overall machine learning system, because, as it turns out, neural networks are better at automatically learning to recognize optimal features than human engineers are at doing it manually. The same principle holds for determining whether software is malicious or a network intrusion is occurring. These events are characterized by large amounts of data with large feature sets. Deep learning excels in organically deriving the important features that distinguish malicious from benign in cybersecurity.
Deep learning’s ability to scale
Deep learning systems have another advantage over conventional approaches: whereas conventional approaches struggle with scale, deep learning systems learn easily from Internet-scale volumes of data using a process called stochastic gradient descent.
This means that we can train our deep learning systems on hundreds of millions of examples of malicious and benign documents, executables files, URLs and HTML pages continually harvested by SophosLabs. The trained system, when deployed, captures the knowledge intrinsic to the training set. It is able to bring this knowledge of the threat landscape into a deployed model that accurately detects new malware.
The conventional decision-tree based machine learning methods popular in the security industry simply cannot scale like deep learning methods.
To understand why, consider that to “learn”, decision trees, at every learning step (and millions of such steps are required), need to compute on all or most of their training data. In contrast, deep neural networks, at each “learning step”, inspect a fixed size batch of training samples (usually only a couple hundred samples or less!). This means that while decision trees’ training memory requirements are massive and scale proportionally with the size of the available training data, the memory requirements remain constant as the available training data grows.
This dramatic difference in scalability gives deep neural networks a huge advantage. To demonstrate the payoff of this scale advantage in terms of detection accuracy, the chart above shows the relationship between the amount of training data we show a malware detection machine learning system and its resulting accuracy at detecting “zero-day” malware samples. The horizontal axis represents the amount of training data we show the system on a given experimental training run and the vertical axis shows its resulting accuracy.
It’s clear from the chart that the system benefits greatly from seeing more training data and, as shown by the blue trend line, will continue to benefit as it’s shown millions and tens of millions of examples of malicious and benign artifacts.
Deep learning also benefits from the large volumes of security-related data that have traditionally been the Achilles heal of Security Operations Centers staffed by humans. The more data produced, the better the deep learning algorithm performs. The problem with most conventional machine learning approaches is that while they would benefit from tens of millions of training examples, it is extremely difficult if not impossible to scale them to this volume of data.
Deep learning is the rapidly advancing bleeding edge
And the cherry on top of Sophos’ deep learning approach is that it is future-proofed. Indeed, by investing in deep learning research and development, we’ve hitched our wagon to a shooting star that includes scientists and engineers at major universities, as well as Google, Facebook, Amazon and Microsoft. This means that innovations in the neural network field can be translated and adapted into improved accuracy and performance of our intrusion and malware detection systems.
Not content to be practitioners of other people’s innovations, we are investing and innovating in deep learning neural network research and contribute our own research to the scientific community in the form of papers and conference presentations.
The deep learning technical trend we are part of becomes clear in the Google Trends plot above. It compares the web popularity of a staple of conventional machine learning to deep learning. The chart demonstrates what everyone in the machine learning community knows: deep learning has long since taken over as the dominant machine learning technical trend and is leaving it in the dust technologically.
Like choosing a hardware chip as your base platform for a device has long run ramifications, so does choosing your machine learning technology. Deep learning gives us the performance and scalability that is needed to solve today’s most challenging security problems.
Where the rubber meets the road
When measured quantitatively and intelligently applied to security problems, we find that deep learning yields higher detection coverage, lower false positives and a smaller on-device footprint compared to other approaches.
To get a feel for the improvement consider the chart below which compares three machine learning approaches developed in our research group at Sophos, each of which detect “zero day” malicious URLs that don’t appear on blacklists.
The horizontal axis of the chart shows the false positive rate (the percentage of benign URLs that are incorrectly classified as malicious) of each detector as we adjust its sensitivity (the rate at which it says benign URLs are malicious) and the vertical axis shows the detection rate (the percentage of malicious URLs that are correctly classified as malicious) at that sensitivity.
The first approach, given by the red line, uses a decision tree approach called “random forest” in conjunction with state of the art features invented by a group of academic computer scientists.
The second approach, shown by the blue line, uses a deep neural network that uses these same manually extracted features. Finally, the green line shows Sophos’ prototype URL detector, an end-to-end deep learning system that automatically identifies optimal URL features from raw URL character strings.
Here’s the punchline: at a false positive rate of one false positive for every million URLs, a deep learning approach achieves a 72% detection rate for previously unseen malicious URLs that don’t appear on blacklists. The conventional decision tree approach can reach this same accuracy, but only if you’re willing to accept a false positive rate of one false positive per one thousand benign URLs – a thousand-fold increase in false positives.
Accuracy is far from the only payoff here though. The deep learning approach has a dramatically lower resource impact: it costs only about 10 megabytes to store the deep neural network on an endpoint, whereas the decision tree approach is basically impossible to deploy on low-resource business endpoints, requiring about 5 gigabytes of disk space. And while it takes about 20 milliseconds to scan a URL using the deep learning approach, it takes 250 milliseconds to scan it with the decision tree approach. These differences in detection times matter when we’re scanning millions of URLs a day on a firewall or doing an enterprise-wide scan of executable binaries.
At Sophos we’ve taken a unique approach to our security machine learning capabilities: we’ve invested heavily in deep neural network technology over more prevalent methods that, while still dominant in the security industry, are being rapidly abandoned by the machine learning computer science community. To recap the advantages that come with it:
- Deep learning automatically identifies what’s important in raw data and in so doing yields better accuracy,
- Deep learning is “big data native”, scaling easily such that it can “memorize” the broad threat landscape and generalize from it to novel threats;
- Deep learning is the dominant technology trend in artificial intelligence, meaning that Sophos’ deep learning strategy benefits from innovation from the major industry players; and
- Deep learning yields better detection rates, lower false positives and dramatically lower footprints, than machine learning detection systems.
Sophos’ data science team will continue to innovate in the security deep learning space, and future blog posts will explore specific Sophos deep learning detection technologies.
Saxe, Joshua and Konstantin Berlin. “Deep neural network based malware detection using two dimensional binary program features.” Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, 2015.
Saxe, Joshua and Konstantin Berlin. “eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys.” arXiv preprint arXiv:1702.08568 (2017).
Berlin, K., & Saxe, J. (2016). Improving Zero-Day Malware Testing Methodology Using Statistically Significant Time-Lagged Test Samples. arXiv preprint arXiv:1608.00669.