By Jason Zhang
Criminals continue to leverage the features of Adobe’s PDF document format to engage in malware and phishing attacks, with no sign of a slowdown. Last year at Black Hat USA, I gave a presentation about PDF-based malware detection using machine learning. We discovered that the best AV engine could only catch fewer than 85% of not-previously-seen malicious PDF samples, but by combining a machine learning model with the AV engine, we were able to improve our detection to accurately detect more than 95% of novel malicious PDFs.
But at the volumes we see of malicious document file types, that 5% can be pretty significant. So since last year, we’ve continued the work on using machine learning to attack the mal-PDF and malicious Microsoft Office document (maldoc) problem. A few weeks ago, at Black Hat Asia 2019 in Singapore, I presented our latest machine learning (ML) research on malware detection using antivirus (AV) heuristic rules, in which most traditional AV solutions are based: Weapons of Office Destruction: Prevention with Machine Learning.
More specifically, the approach leverages the thousands of heuristic rules already used in traditional AV solutions. We believe that strong, domain knowledge-based heuristic rules in extant AV engines provide a powerful, as-yet untapped signal for training ML detection engines that is highly complementary to more typical statistical features that are used in the absence of strong domain knowledge. In essence, the heuristic rules trian and refine the machine learning algorithm so it can do its job better.
Testing maldoc prevention
We evaluated the effectiveness of our proposed approach against one of the most popular malware delivery vectors – weaponized Microsoft Office documents. The broad-brush popularity of Office documents led them to become one of the main malware delivery mechnisms via email attachments or web downloads.
According to a previous Sophos report, over 80% of document-based malware were delivered via either Word or Excel files. Even though these attacks are not new in nature, the increasing volume and complexity of the attacks pose a huge challenge to traditional signature-based AV products.
In the demonstration, we used a comprehensive list of more than 3000 heuristic rules to train the machine learning model. We collected a sample set of about 100,000 Office documents — a mix of malicious and benign files — for testing and to further refine the model.
Our results exhibited promising performance and detection rates. Our proposed approach gets us much closer to the accuracy rates we were trying to achieve, which we were able to measure and show in the Receiver Operating Characteristic (ROC) curve shown in the chart above.
The ROC chart compares true- to false-positives, and shows how the model was able to come extremely close to a perfect “true positive” rate, averaging at greater than 99.5% for text and document file types, and more than 98.5% of compressed .zip archives.`
the proposed approach can obtain ROC curves with AUC over 0.99 for the proposed machine learning model
These test results, which used recent, real-world malware samples, exhibit promising performance: Applying machine learning to the problem of coming to a verdict on an unknown document significantly outperformed ten well-known competitors’ commercial AV scanners, with a much higher true positive rate (TPR).
That’s a big win for machine learning in its fight against malware.