EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

1Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences, 3Tongji University
4School of Biomedical Engineering, UNSW Sydney
5Alibaba Group
alt text

EVADE is the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce.

Abstract

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision–Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this high-stakes, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six high-stakes product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce.

Data Distribution of EVADE

Figure 1. Visualization of the EVADE dataset distribution and prompt lengths across violation categories. The top-right figure presents a word cloud of representative EVADE keywords.

Performance on Single-Violation Task

Table 1. All model overall performance on Single-Violation of EVADE .

Table 2. Qwen3 series models overall performance on Single-Violation of EVADE. The numbers in parentheses on the right side of the table represent the specific performance increase or decrease after enabling the thinking mode, compared to the same model and metric on the left side.

Performance on All-in-One Task

Table 3.The performance of all models on the All-in-One task of EVADE. The values to the left of the slash indicate partial accuracy, while those to the right indicate full accuracy.

Table 4.Qwen3 series models overall performance on All-in-One of EVADE. The values to the left of the slash indicate partial accuracy, while those to the right indicate full accuracy. The numbers in parentheses on the right side of the table represent the specific performance increase or decrease after enabling the thinking mode, compared to the same model and metric on the left side.

Analysis on the Effect of RAG

Figure 2.Comparison of LLMs and VLMs before and after the introduction of RAG. Here, L-denotes the Llama-3.1 model, Q- denotes the Qwen2.5 model, DS- denotes the Deepseek model, Q3- denotes the Qwen3 model, and IVL- denotes the Intern VL3 model.

Bad Case

Figure 3. Bad cases from All-in-One task.

BibTeX

@misc{xu2025evademultimodalbenchmarkevasive,
    title        = {EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications},
    author       = {Ancheng Xu and Zhihao Yang and Jingpeng Li and Guanghu Yuan and Longze Chen and Liang Yan and Jiehui Zhou and Zhen Qin and Hengyun Chang and Hamid Alinejad-Rokny and Bo Zheng and Min Yang},
    year         = {2025},
    eprint       = {2505.17654},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL},
    url          = {https://arxiv.org/abs/2505.17654}
}