Vision-DeepResearch Icon Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

1CUHK MMLab 2East China Normal University 3University of Science and Technology of China 4Xiaohongshu Inc. 5Harbin Institute of Technology 6Zhejiang University 7University of California, Los Angeles 8University of Oxford 9Shenzhen Loop Area Institute
*: Equal Contribution †: Project Leader ✉: Corresponding Author

🚀 The first long-horizon multimodal deep-research MLLM

🔍 Multi-turn, multi-entity, multi-scale visual & textual search in real noisy web

🏆 SOTA on 6 multimodal factual benchmarks with only 8B / 30B models

★ ★ ★ ★ ★

VDR-Bench Icon Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

1CUHK MMLab 2University of Science and Technology of China 3East China Normal University 4Xiaohongshu Inc. 5The University of California, Los Angeles 6Zhejiang University 7Peking University 8University of Oxford 9Shenzhen Loop Area Institute
*: Equal Contribution †: Project Leader ✉: Corresponding Author

📊 A vision-centric benchmark for multimodal deep research evaluation

🔬 Requires genuine visual search — not solvable by text-only cues or model priors

🌐 Reflects real-world settings with iterative entity-level localization and multi-hop reasoning

Overview

Vision-DeepResearch
Vision-DeepResearch Teaser

Figure 1. (A) We identify two key limitations of existing multimodal deep-research paradigms: (A.1) Prior methods largely ignore the search engine hit-rate problem—a single full-image or entity-level query often fails to retrieve the required evidence, and querying different-scale crops of the same entity yields highly variable results. (A.2) Existing methods are constrained in both reasoning depth and retrieval breadth, typically producing only short trajectories. In contrast, our approach supports dozens of reasoning steps and hundreds of engine interactions. (B) Pipeline Overview: We synthesize high-quality VQA instances and multi-turn trajectories, then integrate multimodal deep-research capabilities into an MLLM via SFT and RL training, enabling long-horizon reasoning with multi-turn, multi-entity, and multi-scale visual and textual search. (C) Performance Comparison: Our model achieves SoTA performance on six benchmarks with comparatively smaller parameters.

🚀

First Long-Horizon Multimodal MLLM

Dozens of ReAct steps, hundreds of tool calls for deep research

🔎

Multi-Entity Visual Search

Greatly improves hit rate under real web noise with multi-scale approach

📚

End-to-End Training

30K multimodal trajectories (SFT) + 15K VQA (RL) with real tools

🏆

SOTA Performance

Outperforms GPT-5 / Gemini-2.5-Pro / Claude-4-Sonnet agents

🚀

First Long-Horizon Multimodal MLLM

Dozens of ReAct steps, hundreds of tool calls for deep research

🔎

Multi-Entity Visual Search

Greatly improves hit rate under real web noise with multi-scale approach

📚

End-to-End Training

30K multimodal trajectories (SFT) + 15K VQA (RL) with real tools

🏆

SOTA Performance

Outperforms GPT-5 / Gemini-2.5-Pro / Claude-4-Sonnet agents


VDR-Bench
VDR-Bench Teaser

Figure 2. Motivation: Existing Vision-DeepResearch benchmarks often fail to measure realistic multimodal search-many questions can be solved via text-only cues or model priors without genuine visual verification, and whole-image search frequently retrieves near-duplicate images with identifying metadata ("perfect retrieval"). VDR-Bench is designed to be visual-search–centric and to reflect real-world settings that require iterative, entity-level localization (e.g., multi-round cropping), cross-modal evidence collection, and multi-hop reasoning.

Data Pipeline

Vision-DeepResearch Data Pipeline
Vision-DeepResearch Data Pipeline

Figure 3. Our Data Pipeline. Top panel: We construct a complete multimodal deep-research synthesis pipeline. Leveraging the capabilities of an MLLM and a text-based DeepResearch foundation LLM, we generate long-horizon, multi-tool trajectories. The process involves multi-entity and multi-scale visual cropping and search (producing visual search trajectories), followed by text-based deep research via vision→text bridging (producing text search trajectories). Bottom panel: We obtain high-quality factual VQA instances via a rigorous verification and obfuscation procedure—including entity-level stringent image verification and filtering, random walks over real search engines and web pages, and joint entity/answer obfuscation—which are then used for trajectory synthesis and RL training.

VDR-Bench Data Pipeline
VDR-Bench Data Pipeline

Figure 4. VDR-Bench is constructed via a multi-stage, vision-centric workflow: (Step 1) Annotators manually crop salient regions (objects, logos, landmarks, individuals) and perform web-scale visual search; (Step 2) Candidate entities are extracted from retrieved results and verified through MLLM-assisted and human checking processes; (Step 3) Verified visual entities are used to generate seed VQA pairs that require explicit recognition and grounding; (Step 4) Question difficulty is expanded via knowledge-graph–based multi-hop reasoning through random walks; and (Step 5) Automatic solvability checks and human quality filtering ensure each instance requires visual evidence, remains unambiguous, and avoids trivial or near-duplicate retrieval.

Performance

Evaluation on 6 multimodal factual benchmarks

Model VDR FVQA MMSearch+ MMSearch LiveVQA BC-VL Avg.
Direct Answer
GPT-5 9.857.319.133.357.547.237.4
Gemini-2.5 Pro 8.060.714.539.860.343.137.7
Claude-4-Sonnet 2.035.34.018.738.529.321.3
Qwen3-VL-8B-Thinking 5.624.02.715.843.325.119.4
Qwen3-VL-30B-A3B-Thinking 4.432.74.519.349.034.624.1
Agent Workflow
GPT-5 20.469.017.263.773.346.148.3
Gemini-2.5 Pro 18.868.322.269.076.049.950.7
Claude-4-Sonnet 13.669.023.167.269.748.648.5
Qwen3-VL-8B-Thinking 17.651.312.245.656.337.136.7
Qwen3-VL-30B-A3B-Thinking 23.263.013.653.262.044.143.2
Multimodal DeepResearch MLLM
MMSearch-R1-7B --58.4--53.848.4----
Webwatcher-32B ------55.358.726.7--
Ours
Qwen3-VL-8B-Instruct (Agentic) 17.058.711.352.063.038.640.1
Vision-DeepResearch-8B 29.2 (+12.2) 64.7 (+6.0) 20.4 (+9.1) 69.6 (+17.6) 76.7 (+13.7) 42.6 (+4.0) 50.5 (+10.4)
Qwen3-VL-30B-A3B-Instruct (Agentic) 20.257.710.055.060.042.640.9
Vision-DeepResearch-30B-A3B 37.8 (+17.6) 74.2 (+16.5) 28.5 (+18.5) 69.6 (+14.6) 77.6 (+17.6) 53.7 (+11.1) 56.9 (+16.0)

Table 1. Benchmark results across different settings with improvement (Δ, compared with base MLLM in agentic workflow setting). VDR: VDR-Bench, MMSearch+: MMSearch-Plus, BC-VL: BrowseComp-VL. Our Vision-DeepResearch models achieve the best performance among all methods, substantially outperforming both proprietary models (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet) and existing multimodal deep-research MLLMs (MMSearch-R1, WebWatcher).

Ablation Studies

Pipeline Ablation
Setting VDR MMS+ BC-VL Avg.
Direct Answer 4.83.627.612.0
WIS (Whole Image Search) 11.810.026.116.0
WIS + TS (Text Search) 16.023.548.429.3
CIS (Cropped Image Search) 15.422.730.823.0
CIS + TS (Full Pipeline) 37.828.553.740.0

Table 2. Ablation study on rollout pipeline. WIS: Whole Image Search, TS: Text Search, CIS: Cropped Image Search (multi-entity, multi-scale). The full pipeline (CIS+TS) achieves the best performance, demonstrating that multi-scale visual cropping and text search are jointly necessary for robust multimodal deep research.

Training Data & Methods Ablation
Model VDR MMS+ BC-VL Avg.
Qwen3-VL-30B-Instruct (Base) 20.210.042.624.3
+ 16K VQA traj. (SFT) 24.423.550.932.9
+ 8K QA traj. (SFT) 27.023.550.133.5
+ 6K fuzzy VQA traj. (SFT) 33.226.051.436.9
+ RL training 37.828.553.740.0

Table 3. Ablation results on training data and methods. Each row adds components incrementally. VQA trajectories provide the foundation, QA trajectories enable text-based deep research transfer, fuzzy multi-hop VQA covers long-tail settings, and RL training refines long-horizon decision making through online interaction.

VDR-Bench Detailed Results

Performance Comparison of Models Across Different Categories (Accuracy %)

Model / Setting People Object Arch. Nature Sci&Tech Art&Music Sports Movie Game Other Overall
Gemini 2.5 Pro
Direct Answer 6.49.89.88.212.011.84.22.07.79.68.2
CIS+TS 14.915.727.512.224.017.612.510.21.925.016.2
CIS+TS+MVF 38.323.533.324.522.039.225.024.521.248.130.0
GPT-5
Direct Answer 4.49.811.712.310.07.88.48.23.813.59.5
CIS+TS 20.817.614.016.724.521.212.519.320.825.019.2
CIS+TS+MVF 23.425.523.520.418.027.522.930.630.842.326.6
Claude-4-Sonnet
Direct Answer 2.13.97.86.210.07.82.20.03.85.65.6
CIS+TS 14.99.819.616.318.011.810.44.13.823.113.2
CIS+TS+MVF 12.517.624.035.415.126.916.712.323.124.420.6
Qwen3-VL-30B-A3B-Instruct
Direct Answer 3.93.96.12.04.10.07.73.80.07.83.9
CIS+TS 17.019.617.616.320.05.914.610.25.844.217.2
CIS+TS+MVF 25.521.623.518.48.023.516.718.428.826.921.2
Qwen3-VL-235B-A22B-Instruct
Direct Answer 6.23.910.022.97.513.56.23.57.57.58.8
CIS+TS 25.219.524.021.118.517.110.729.116.631.521.2
CIS+TS+MVF 25.023.530.031.230.228.820.822.830.232.527.4

Table 4. Performance Comparison of Models Across Different Categories on VDR-Bench. Direct Answer: models directly answer VQA without search tools. CIS+TS: Cropped Image Search + Text Search. MVF: Multi-turn Visual Forcing strategy. The MVF strategy consistently improves performance across all models, with Gemini 2.5 Pro achieving the highest overall score (30.0%) after applying MVF.

BibTeX

@article{huang2026vision,
  title={Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Zeng, Yu and Wang, Qiuchen and Fang, Zhen and Cao, Shaosheng and Chu, Zheng and Yin, Qingyu and Chen, Shuang and Yin, Zhenfei and Chen, Lin and others},
  journal={arXiv preprint arXiv:2601.22060},
  year={2026}
}

@article{vdr-bench,
  title={VDR-Bench: Rethinking Visual and Textual Search for Multimodal Large Language Models},
  author={Zeng, Yu and Huang, Wenxuan and Fang, Zhen and Chen, Shuang and Shen, Yufan and Cai, Yishuo and Wang, Xiaoman and Yin, Zhenfei and Chen, Lin and Chen, Zehui and Huang, Shiting and Zhao, Yiming and Hu, Yao and Torr, Philip and Ouyang, Wanli and Cao, Shaosheng},
  journal={preprint},
  year={2026}
}