Sophos AI to present on how to defang malicious AI models at Black Hat Europe

Sean Gallagher

20 hours ago

At this week’s Black Hat Europe in London, SophosAI’s Senior Data Scientist Tamás Vörös will deliver a 40-minute presentation entitled “LLMbotomy: Shutting the Trojan Backdoors” at 1:30 PM. Vörös’ talk, which is an expansion on a presentation he gave at the recent CAMLIS conference, delves into the potential risks posed by Trojanized Large Language Models (LLMs) and how those risks can be mitigated by those using potentially weaponized LLMs.

Existing research on LLMs has primarily focused on external threats to LLMs, such as “prompt injection” attacks that could be used to data embedded in previously submitted instructions from other users and other input-based attacks on LLMs themselves. SophosAI’s research, presented by Vörös, examined embedded threats, such as Trojan backdoors inserted into LLMs during their training and triggered by specific inputs intended to cause harmful behaviors. These embedded threats could be deliberately introduced through malicious intent of someone involved in the model’s training, or inadvertently through data poisoning. The research investigated not only how these trojans could be created, but also a method to disable them.

SophosAI’s research demonstrated the use of targeted “noising” of an LLM’s neurons, identifying those critical to the operation of the LLM through their activation patterns. The technique was demonstrated to effectively neutralize most Trojans embedded in in a model. A full report on the research presented by Vörös will be published after Black Hat Europe.