Adversarial Confusion Attacks: Making GPT-5 Hallucinate

bron123 · 2025-10-06T14:21:28 1759760488

We’ve released a preliminary work introducing a new vector of attack against multimodal LLMs. Unlike jailbreaks or targeted misclassification, this method explicitly optimizes for confusion - maximizing next-token entropy to induce systematic malfunction.

The attack successfully fooled GPT-5, producing structured hallucinations from an adversarial image.

The ultimate goal is to prevent AI Agents from reliably operating on websites by embedding adversarial “confusion images” into their visual environments.