🚀 The first long-horizon multimodal deep-research MLLM
🔍 Multi-turn, multi-entity, multi-scale visual & textual search in real noisy web
🏆 SOTA on 6 multimodal factual benchmarks with only 8B / 30B models
📊 A vision-centric benchmark for multimodal deep research evaluation
🔬 Requires genuine visual search — not solvable by text-only cues or model priors
🌐 Reflects real-world settings with iterative entity-level localization and multi-hop reasoning
Figure 1. (A) We identify two key limitations of existing multimodal deep-research paradigms: (A.1) Prior methods largely ignore the search engine hit-rate problem—a single full-image or entity-level query often fails to retrieve the required evidence, and querying different-scale crops of the same entity yields highly variable results. (A.2) Existing methods are constrained in both reasoning depth and retrieval breadth, typically producing only short trajectories. In contrast, our approach supports dozens of reasoning steps and hundreds of engine interactions. (B) Pipeline Overview: We synthesize high-quality VQA instances and multi-turn trajectories, then integrate multimodal deep-research capabilities into an MLLM via SFT and RL training, enabling long-horizon reasoning with multi-turn, multi-entity, and multi-scale visual and textual search. (C) Performance Comparison: Our model achieves SoTA performance on six benchmarks with comparatively smaller parameters.
Dozens of ReAct steps, hundreds of tool calls for deep research
Greatly improves hit rate under real web noise with multi-scale approach
30K multimodal trajectories (SFT) + 15K VQA (RL) with real tools
Outperforms GPT-5 / Gemini-2.5-Pro / Claude-4-Sonnet agents
Dozens of ReAct steps, hundreds of tool calls for deep research
Greatly improves hit rate under real web noise with multi-scale approach
30K multimodal trajectories (SFT) + 15K VQA (RL) with real tools
Outperforms GPT-5 / Gemini-2.5-Pro / Claude-4-Sonnet agents
Figure 2. Motivation: Existing Vision-DeepResearch benchmarks often fail to measure realistic multimodal search-many questions can be solved via text-only cues or model priors without genuine visual verification, and whole-image search frequently retrieves near-duplicate images with identifying metadata ("perfect retrieval"). VDR-Bench is designed to be visual-search–centric and to reflect real-world settings that require iterative, entity-level localization (e.g., multi-round cropping), cross-modal evidence collection, and multi-hop reasoning.
Figure 3. Our Data Pipeline. Top panel: We construct a complete multimodal deep-research synthesis pipeline. Leveraging the capabilities of an MLLM and a text-based DeepResearch foundation LLM, we generate long-horizon, multi-tool trajectories. The process involves multi-entity and multi-scale visual cropping and search (producing visual search trajectories), followed by text-based deep research via vision→text bridging (producing text search trajectories). Bottom panel: We obtain high-quality factual VQA instances via a rigorous verification and obfuscation procedure—including entity-level stringent image verification and filtering, random walks over real search engines and web pages, and joint entity/answer obfuscation—which are then used for trajectory synthesis and RL training.
Figure 4. VDR-Bench is constructed via a multi-stage, vision-centric workflow: (Step 1) Annotators manually crop salient regions (objects, logos, landmarks, individuals) and perform web-scale visual search; (Step 2) Candidate entities are extracted from retrieved results and verified through MLLM-assisted and human checking processes; (Step 3) Verified visual entities are used to generate seed VQA pairs that require explicit recognition and grounding; (Step 4) Question difficulty is expanded via knowledge-graph–based multi-hop reasoning through random walks; and (Step 5) Automatic solvability checks and human quality filtering ensure each instance requires visual evidence, remains unambiguous, and avoids trivial or near-duplicate retrieval.
Evaluation on 6 multimodal factual benchmarks
| Model | VDR | FVQA | MMSearch+ | MMSearch | LiveVQA | BC-VL | Avg. |
|---|---|---|---|---|---|---|---|
| Direct Answer | |||||||
| GPT-5 | 9.8 | 57.3 | 19.1 | 33.3 | 57.5 | 47.2 | 37.4 |
| Gemini-2.5 Pro | 8.0 | 60.7 | 14.5 | 39.8 | 60.3 | 43.1 | 37.7 |
| Claude-4-Sonnet | 2.0 | 35.3 | 4.0 | 18.7 | 38.5 | 29.3 | 21.3 |
| Qwen3-VL-8B-Thinking | 5.6 | 24.0 | 2.7 | 15.8 | 43.3 | 25.1 | 19.4 |
| Qwen3-VL-30B-A3B-Thinking | 4.4 | 32.7 | 4.5 | 19.3 | 49.0 | 34.6 | 24.1 |
| Agent Workflow | |||||||
| GPT-5 | 20.4 | 69.0 | 17.2 | 63.7 | 73.3 | 46.1 | 48.3 |
| Gemini-2.5 Pro | 18.8 | 68.3 | 22.2 | 69.0 | 76.0 | 49.9 | 50.7 |
| Claude-4-Sonnet | 13.6 | 69.0 | 23.1 | 67.2 | 69.7 | 48.6 | 48.5 |
| Qwen3-VL-8B-Thinking | 17.6 | 51.3 | 12.2 | 45.6 | 56.3 | 37.1 | 36.7 |
| Qwen3-VL-30B-A3B-Thinking | 23.2 | 63.0 | 13.6 | 53.2 | 62.0 | 44.1 | 43.2 |
| Multimodal DeepResearch MLLM | |||||||
| MMSearch-R1-7B | -- | 58.4 | -- | 53.8 | 48.4 | -- | -- |
| Webwatcher-32B | -- | -- | -- | 55.3 | 58.7 | 26.7 | -- |
| Ours | |||||||
| Qwen3-VL-8B-Instruct (Agentic) | 17.0 | 58.7 | 11.3 | 52.0 | 63.0 | 38.6 | 40.1 |
| Vision-DeepResearch-8B | 29.2 (+12.2) | 64.7 (+6.0) | 20.4 (+9.1) | 69.6 (+17.6) | 76.7 (+13.7) | 42.6 (+4.0) | 50.5 (+10.4) |
| Qwen3-VL-30B-A3B-Instruct (Agentic) | 20.2 | 57.7 | 10.0 | 55.0 | 60.0 | 42.6 | 40.9 |
| Vision-DeepResearch-30B-A3B | 37.8 (+17.6) | 74.2 (+16.5) | 28.5 (+18.5) | 69.6 (+14.6) | 77.6 (+17.6) | 53.7 (+11.1) | 56.9 (+16.0) |
Table 1. Benchmark results across different settings with improvement (Δ, compared with base MLLM in agentic workflow setting). VDR: VDR-Bench, MMSearch+: MMSearch-Plus, BC-VL: BrowseComp-VL. Our Vision-DeepResearch models achieve the best performance among all methods, substantially outperforming both proprietary models (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet) and existing multimodal deep-research MLLMs (MMSearch-R1, WebWatcher).
| Setting | VDR | MMS+ | BC-VL | Avg. |
|---|---|---|---|---|
| Direct Answer | 4.8 | 3.6 | 27.6 | 12.0 |
| WIS (Whole Image Search) | 11.8 | 10.0 | 26.1 | 16.0 |
| WIS + TS (Text Search) | 16.0 | 23.5 | 48.4 | 29.3 |
| CIS (Cropped Image Search) | 15.4 | 22.7 | 30.8 | 23.0 |
| CIS + TS (Full Pipeline) | 37.8 | 28.5 | 53.7 | 40.0 |
Table 2. Ablation study on rollout pipeline. WIS: Whole Image Search, TS: Text Search, CIS: Cropped Image Search (multi-entity, multi-scale). The full pipeline (CIS+TS) achieves the best performance, demonstrating that multi-scale visual cropping and text search are jointly necessary for robust multimodal deep research.
| Model | VDR | MMS+ | BC-VL | Avg. |
|---|---|---|---|---|
| Qwen3-VL-30B-Instruct (Base) | 20.2 | 10.0 | 42.6 | 24.3 |
| + 16K VQA traj. (SFT) | 24.4 | 23.5 | 50.9 | 32.9 |
| + 8K QA traj. (SFT) | 27.0 | 23.5 | 50.1 | 33.5 |
| + 6K fuzzy VQA traj. (SFT) | 33.2 | 26.0 | 51.4 | 36.9 |
| + RL training | 37.8 | 28.5 | 53.7 | 40.0 |
Table 3. Ablation results on training data and methods. Each row adds components incrementally. VQA trajectories provide the foundation, QA trajectories enable text-based deep research transfer, fuzzy multi-hop VQA covers long-tail settings, and RL training refines long-horizon decision making through online interaction.
Performance Comparison of Models Across Different Categories (Accuracy %)
| Model / Setting | People | Object | Arch. | Nature | Sci&Tech | Art&Music | Sports | Movie | Game | Other | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | |||||||||||
| Direct Answer | 6.4 | 9.8 | 9.8 | 8.2 | 12.0 | 11.8 | 4.2 | 2.0 | 7.7 | 9.6 | 8.2 |
| CIS+TS | 14.9 | 15.7 | 27.5 | 12.2 | 24.0 | 17.6 | 12.5 | 10.2 | 1.9 | 25.0 | 16.2 |
| CIS+TS+MVF | 38.3 | 23.5 | 33.3 | 24.5 | 22.0 | 39.2 | 25.0 | 24.5 | 21.2 | 48.1 | 30.0 |
| GPT-5 | |||||||||||
| Direct Answer | 4.4 | 9.8 | 11.7 | 12.3 | 10.0 | 7.8 | 8.4 | 8.2 | 3.8 | 13.5 | 9.5 |
| CIS+TS | 20.8 | 17.6 | 14.0 | 16.7 | 24.5 | 21.2 | 12.5 | 19.3 | 20.8 | 25.0 | 19.2 |
| CIS+TS+MVF | 23.4 | 25.5 | 23.5 | 20.4 | 18.0 | 27.5 | 22.9 | 30.6 | 30.8 | 42.3 | 26.6 |
| Claude-4-Sonnet | |||||||||||
| Direct Answer | 2.1 | 3.9 | 7.8 | 6.2 | 10.0 | 7.8 | 2.2 | 0.0 | 3.8 | 5.6 | 5.6 |
| CIS+TS | 14.9 | 9.8 | 19.6 | 16.3 | 18.0 | 11.8 | 10.4 | 4.1 | 3.8 | 23.1 | 13.2 |
| CIS+TS+MVF | 12.5 | 17.6 | 24.0 | 35.4 | 15.1 | 26.9 | 16.7 | 12.3 | 23.1 | 24.4 | 20.6 |
| Qwen3-VL-30B-A3B-Instruct | |||||||||||
| Direct Answer | 3.9 | 3.9 | 6.1 | 2.0 | 4.1 | 0.0 | 7.7 | 3.8 | 0.0 | 7.8 | 3.9 |
| CIS+TS | 17.0 | 19.6 | 17.6 | 16.3 | 20.0 | 5.9 | 14.6 | 10.2 | 5.8 | 44.2 | 17.2 |
| CIS+TS+MVF | 25.5 | 21.6 | 23.5 | 18.4 | 8.0 | 23.5 | 16.7 | 18.4 | 28.8 | 26.9 | 21.2 |
| Qwen3-VL-235B-A22B-Instruct | |||||||||||
| Direct Answer | 6.2 | 3.9 | 10.0 | 22.9 | 7.5 | 13.5 | 6.2 | 3.5 | 7.5 | 7.5 | 8.8 |
| CIS+TS | 25.2 | 19.5 | 24.0 | 21.1 | 18.5 | 17.1 | 10.7 | 29.1 | 16.6 | 31.5 | 21.2 |
| CIS+TS+MVF | 25.0 | 23.5 | 30.0 | 31.2 | 30.2 | 28.8 | 20.8 | 22.8 | 30.2 | 32.5 | 27.4 |
Table 4. Performance Comparison of Models Across Different Categories on VDR-Bench. Direct Answer: models directly answer VQA without search tools. CIS+TS: Cropped Image Search + Text Search. MVF: Multi-turn Visual Forcing strategy. The MVF strategy consistently improves performance across all models, with Gemini 2.5 Pro achieving the highest overall score (30.0%) after applying MVF.
@article{huang2026vision,
title={Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models},
author={Huang, Wenxuan and Zeng, Yu and Wang, Qiuchen and Fang, Zhen and Cao, Shaosheng and Chu, Zheng and Yin, Qingyu and Chen, Shuang and Yin, Zhenfei and Chen, Lin and others},
journal={arXiv preprint arXiv:2601.22060},
year={2026}
}
@article{vdr-bench,
title={VDR-Bench: Rethinking Visual and Textual Search for Multimodal Large Language Models},
author={Zeng, Yu and Huang, Wenxuan and Fang, Zhen and Chen, Shuang and Shen, Yufan and Cai, Yishuo and Wang, Xiaoman and Yin, Zhenfei and Chen, Lin and Chen, Zehui and Huang, Shiting and Zhao, Yiming and Hu, Yao and Torr, Philip and Ouyang, Wanli and Cao, Shaosheng},
journal={preprint},
year={2026}
}