The sixth sense for cyber defense: Multimodal AI

Younghoo Lee

4 days ago

At the 2024 Virus Bulletin conference, Sophos Principal Data Scientist Younghoo Lee presented a paper on SophosAI’s research into ‘multimodal’ AI (a system that integrates diverse data types into a unified analytical framework). In his talk, Lee explored the team’s novel empirical research on applying multimodal AI to the detection of spam, phishing, and unsafe web content.

What is multimodal AI?

Multimodal AI represents a significant shift in artificial intelligence. Rather than traditional single-mode analysis, multimodal systems can process multiple data streams simultaneously, synthesizing data from multiple inputs.

In the context of cybersecurity – and particularly when it comes to classifying threats – this is a powerful capability. Rather than analyzing textual and visual content separately, a multimodal system can process both, and ‘understand’ the intricate relationships between them.

For example, in phishing detection, multimodal AI examines the linguistic patterns and writing style of the text alongside the visual fidelity of logos and branding elements, while also analyzing the semantic consistency between textual and visual components. This holistic approach means that the system can identify sophisticated attacks that might appear, to more traditional systems, to be legitimate. Moreover, multimodal AI can learn from, and adapt to, the correlations between different data types, developing a sense of how legitimate and malicious content differs across multiple dimensions.

Capabilities

In his research, Lee details some of the detection capabilities of multimodal AI systems:

Text analysis and natural language understanding

Analysis of linguistic patterns, writing style, and contextual cues to identify manipulation attempts
Detection of social engineering tactics such as manufactured urgency and unusual requests for sensitive information
Maintenance of an evolving database of phishing pretexts and narratives

Visual intelligence and brand verification

Comparison of logos, corporate styling, and visual layouts to legitimate templates
Detection of subtle differences in brand colors, fonts, and layouts
Examination of image metadata and digital signatures

Advanced URL and security analysis

Identification of deceptive techniques like typosquatting and homograph attacks
Analysis of relationships between displayed link text and actual destinations
Detection of attempts to obscure malicious URLs with styling and formatting tricks

Case study: A fake Costco email

The below image is a genuine phishing attempt, designed to trick recipients into thinking that they have won a prize from Costco. The email looks official, complete with imitated Costco logo and branding.

Figure 1: A screenshot of a phishing email, purportedly from Costco

Multimodal AI can identify several suspicious aspects of this email, including:

Phrases used to incite urgency and action
The sender’s email domain not matching legitimate domains
Inconsistencies with logos and images

As a result, the system assigns a high score to the email, flagging it as suspicious.

SophosAI also applied multimodal AI to NSFW (not safe for work) websites containing content relating to gambling, weapons, and more. As with the classification of phishing emails, detection leverages a number of capabilities, including the evaluation of keywords and phrases (agnostic of language), and analysis of imagery and graphics.

Experimental results

To test the efficacy of multimodal AI compared to traditional machine learning models such as Random Forest and XGBoost, SophosAI conducted a series of empirical experiments. The full results are available in Lee’s whitepaper and Virus Bulletin talk – but, briefly, traditional models performed well when detecting known threats, and struggled with new, unseen phishing emails. Their F1 scores (a measure that balances precision and recall to give an overall representation of accuracy between 0 and 1) were as low as 0.53 with unseen samples, reaching a high of 0.66. In contrast, multimodal AI (using GPT-4o) performed very well in detecting new phishing attempts, achieving F1 scores up to 0.97 even on unseen brands.

It was a similar story with NSFW content; traditional models achieved F1 scores of around 0.84-0.88, but models with multimodal AI embeddings achieved scores of up to 0.96.

Conclusion

The digital landscape is in a state of constant evolution, bringing with it an array of new threats – including the use of generative AI to deceive users. Phishing emails now meticulously, and routinely, mimic legitimate communications, while NSFW websites conceal harmful content behind deceptive visuals. While traditional cybersecurity methods remain important, they are increasingly inadequate on their own. Multimodal AI offers an innovative layer of defense that enhances our comprehension of content.

By effectively detecting sophisticated phishing emails and accurately classifying NSFW websites, multimodal AI not only protects users more effectively but also adapts to new threats. The experimental results Lee presents in his paper show significant improvements over traditional methods.

Going forward, incorporating multimodal AI into cybersecurity strategies is not just beneficial; it is crucial for ensuring the protection of our digital environment amid growing complexities and threats.

For further information, Lee’s full whitepaper is available here. A recording of his 2024 Virus Bulletin talk is available here (along with the slides).