Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang^1,2*†, Yu Zeng^3*, Qiuchen Wang^3*, Zhen Fang³, Shaosheng Cao^4✉, Zheng Chu⁵, Qingyu Yin⁶, Shuang Chen⁷, Zhenfei Yin⁸, Lin Chen³, Zehui Chen³, Yao Hu⁴, Philip Torr⁸, Feng Zhao³, Wanli Ouyang^1,9✉

¹CUHK MMLab ²East China Normal University ³University of Science and Technology of China ⁴Xiaohongshu Inc. ⁵Harbin Institute of Technology ⁶Zhejiang University ⁷University of California, Los Angeles ⁸University of Oxford ⁹Shenzhen Loop Area Institute

*: Equal Contribution †: Project Leader ✉: Corresponding Author

📑 Paper Code 🤗 Cold-start Dataset (demo) 🤗 RL Dataset (demo)

🤗 Vision-DeepResearch-30B-A3B (coming soon) 🤗 Vision-DeepResearch-8B (SFT-only)

🚀 The first long-horizon multimodal deep-research MLLM

🔍 Multi-turn, multi-entity, multi-scale visual & textual search in real noisy web

🏆 SOTA on 6 multimodal factual benchmarks with only 8B / 30B models

★ ★ ★ ★ ★

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng^2*, Wenxuan Huang^1,3*†✉, Zhen Fang^2*, Shuang Chen⁵, Yufan Shen⁶, Yishuo Cai⁷, Xiaoman Wang³, Zhenfei Yin⁸, Lin Chen², Zehui Chen², Shiting Huang², Yiming Zhao², Yao Hu⁴, Philip Torr⁸, Wanli Ouyang^1,9, Shaosheng Cao^4✉

¹CUHK MMLab ²University of Science and Technology of China ³East China Normal University ⁴Xiaohongshu Inc. ⁵The University of California, Los Angeles ⁶Zhejiang University ⁷Peking University ⁸University of Oxford ⁹Shenzhen Loop Area Institute

*: Equal Contribution †: Project Leader ✉: Corresponding Author

📑 Paper Code

🤗 VDR-Bench (full) 🤗 VDR-Bench (testmini)

📊 A vision-centric benchmark for multimodal deep research evaluation

🔬 Requires genuine visual search — not solvable by text-only cues or model priors

🌐 Reflects real-world settings with iterative entity-level localization and multi-hop reasoning

Overview

Vision-DeepResearch

Figure 1. (A) We identify two key limitations of existing multimodal deep-research paradigms: (A.1) Prior methods largely ignore the search engine hit-rate problem—a single full-image or entity-level query often fails to retrieve the required evidence, and querying different-scale crops of the same entity yields highly variable results. (A.2) Existing methods are constrained in both reasoning depth and retrieval breadth, typically producing only short trajectories. In contrast, our approach supports dozens of reasoning steps and hundreds of engine interactions. (B) Pipeline Overview: We synthesize high-quality VQA instances and multi-turn trajectories, then integrate multimodal deep-research capabilities into an MLLM via SFT and RL training, enabling long-horizon reasoning with multi-turn, multi-entity, and multi-scale visual and textual search. (C) Performance Comparison: Our model achieves SoTA performance on six benchmarks with comparatively smaller parameters.

🚀

First Long-Horizon Multimodal MLLM

Dozens of ReAct steps, hundreds of tool calls for deep research

🔎

Multi-Entity Visual Search

Greatly improves hit rate under real web noise with multi-scale approach

📚

End-to-End Training

30K multimodal trajectories (SFT) + 15K VQA (RL) with real tools

🏆

SOTA Performance

Outperforms GPT-5 / Gemini-2.5-Pro / Claude-4-Sonnet agents

🚀

First Long-Horizon Multimodal MLLM

Dozens of ReAct steps, hundreds of tool calls for deep research

🔎

Multi-Entity Visual Search

Greatly improves hit rate under real web noise with multi-scale approach

📚

End-to-End Training

30K multimodal trajectories (SFT) + 15K VQA (RL) with real tools

🏆

SOTA Performance

Outperforms GPT-5 / Gemini-2.5-Pro / Claude-4-Sonnet agents

VDR-Bench

Figure 2. Motivation: Existing Vision-DeepResearch benchmarks often fail to measure realistic multimodal search-many questions can be solved via text-only cues or model priors without genuine visual verification, and whole-image search frequently retrieves near-duplicate images with identifying metadata ("perfect retrieval"). VDR-Bench is designed to be visual-search–centric and to reflect real-world settings that require iterative, entity-level localization (e.g., multi-round cropping), cross-modal evidence collection, and multi-hop reasoning.

Data Pipeline

Vision-DeepResearch Data Pipeline

Figure 3. Our Data Pipeline. Top panel: We construct a complete multimodal deep-research synthesis pipeline. Leveraging the capabilities of an MLLM and a text-based DeepResearch foundation LLM, we generate long-horizon, multi-tool trajectories. The process involves multi-entity and multi-scale visual cropping and search (producing visual search trajectories), followed by text-based deep research via vision→text bridging (producing text search trajectories). Bottom panel: We obtain high-quality factual VQA instances via a rigorous verification and obfuscation procedure—including entity-level stringent image verification and filtering, random walks over real search engines and web pages, and joint entity/answer obfuscation—which are then used for trajectory synthesis and RL training.

VDR-Bench Data Pipeline

Figure 4. VDR-Bench is constructed via a multi-stage, vision-centric workflow: (Step 1) Annotators manually crop salient regions (objects, logos, landmarks, individuals) and perform web-scale visual search; (Step 2) Candidate entities are extracted from retrieved results and verified through MLLM-assisted and human checking processes; (Step 3) Verified visual entities are used to generate seed VQA pairs that require explicit recognition and grounding; (Step 4) Question difficulty is expanded via knowledge-graph–based multi-hop reasoning through random walks; and (Step 5) Automatic solvability checks and human quality filtering ensure each instance requires visual evidence, remains unambiguous, and avoids trivial or near-duplicate retrieval.

Performance

Evaluation on 6 multimodal factual benchmarks

Model	VDR	FVQA	MMSearch+	MMSearch	LiveVQA	BC-VL	Avg.
Direct Answer
GPT-5	9.8	57.3	19.1	33.3	57.5	47.2	37.4
Gemini-2.5 Pro	8.0	60.7	14.5	39.8	60.3	43.1	37.7
Claude-4-Sonnet	2.0	35.3	4.0	18.7	38.5	29.3	21.3
Qwen3-VL-8B-Thinking	5.6	24.0	2.7	15.8	43.3	25.1	19.4
Qwen3-VL-30B-A3B-Thinking	4.4	32.7	4.5	19.3	49.0	34.6	24.1
Agent Workflow
GPT-5	20.4	69.0	17.2	63.7	73.3	46.1	48.3
Gemini-2.5 Pro	18.8	68.3	22.2	69.0	76.0	49.9	50.7
Claude-4-Sonnet	13.6	69.0	23.1	67.2	69.7	48.6	48.5
Qwen3-VL-8B-Thinking	17.6	51.3	12.2	45.6	56.3	37.1	36.7
Qwen3-VL-30B-A3B-Thinking	23.2	63.0	13.6	53.2	62.0	44.1	43.2
Multimodal DeepResearch MLLM
MMSearch-R1-7B	--	58.4	--	53.8	48.4	--	--
Webwatcher-32B	--	--	--	55.3	58.7	26.7	--
Ours
Qwen3-VL-8B-Instruct (Agentic)	17.0	58.7	11.3	52.0	63.0	38.6	40.1
Vision-DeepResearch-8B	29.2 (+12.2)	64.7 (+6.0)	20.4 (+9.1)	69.6 (+17.6)	76.7 (+13.7)	42.6 (+4.0)	50.5 (+10.4)
Qwen3-VL-30B-A3B-Instruct (Agentic)	20.2	57.7	10.0	55.0	60.0	42.6	40.9
Vision-DeepResearch-30B-A3B	37.8 (+17.6)	74.2 (+16.5)	28.5 (+18.5)	69.6 (+14.6)	77.6 (+17.6)	53.7 (+11.1)	56.9 (+16.0)

Table 1. Benchmark results across different settings with improvement (Δ, compared with base MLLM in agentic workflow setting). VDR: VDR-Bench, MMSearch+: MMSearch-Plus, BC-VL: BrowseComp-VL. Our Vision-DeepResearch models achieve the best performance among all methods, substantially outperforming both proprietary models (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet) and existing multimodal deep-research MLLMs (MMSearch-R1, WebWatcher).

Ablation Studies

Pipeline Ablation

Setting	VDR	MMS+	BC-VL	Avg.
Direct Answer	4.8	3.6	27.6	12.0
WIS (Whole Image Search)	11.8	10.0	26.1	16.0
WIS + TS (Text Search)	16.0	23.5	48.4	29.3
CIS (Cropped Image Search)	15.4	22.7	30.8	23.0
CIS + TS (Full Pipeline)	37.8	28.5	53.7	40.0

Table 2. Ablation study on rollout pipeline. WIS: Whole Image Search, TS: Text Search, CIS: Cropped Image Search (multi-entity, multi-scale). The full pipeline (CIS+TS) achieves the best performance, demonstrating that multi-scale visual cropping and text search are jointly necessary for robust multimodal deep research.

Training Data & Methods Ablation

Model	VDR	MMS+	BC-VL	Avg.
Qwen3-VL-30B-Instruct (Base)	20.2	10.0	42.6	24.3
+ 16K VQA traj. (SFT)	24.4	23.5	50.9	32.9
+ 8K QA traj. (SFT)	27.0	23.5	50.1	33.5
+ 6K fuzzy VQA traj. (SFT)	33.2	26.0	51.4	36.9
+ RL training	37.8	28.5	53.7	40.0

Table 3. Ablation results on training data and methods. Each row adds components incrementally. VQA trajectories provide the foundation, QA trajectories enable text-based deep research transfer, fuzzy multi-hop VQA covers long-tail settings, and RL training refines long-horizon decision making through online interaction.

VDR-Bench Detailed Results

Performance Comparison of Models Across Different Categories (Accuracy %)

Model / Setting	People	Object	Arch.	Nature	Sci&Tech	Art&Music	Sports	Movie	Game	Other	Overall
Gemini 2.5 Pro
Direct Answer	6.4	9.8	9.8	8.2	12.0	11.8	4.2	2.0	7.7	9.6	8.2
CIS+TS	14.9	15.7	27.5	12.2	24.0	17.6	12.5	10.2	1.9	25.0	16.2
CIS+TS+MVF	38.3	23.5	33.3	24.5	22.0	39.2	25.0	24.5	21.2	48.1	30.0
GPT-5
Direct Answer	4.4	9.8	11.7	12.3	10.0	7.8	8.4	8.2	3.8	13.5	9.5
CIS+TS	20.8	17.6	14.0	16.7	24.5	21.2	12.5	19.3	20.8	25.0	19.2
CIS+TS+MVF	23.4	25.5	23.5	20.4	18.0	27.5	22.9	30.6	30.8	42.3	26.6
Claude-4-Sonnet
Direct Answer	2.1	3.9	7.8	6.2	10.0	7.8	2.2	0.0	3.8	5.6	5.6
CIS+TS	14.9	9.8	19.6	16.3	18.0	11.8	10.4	4.1	3.8	23.1	13.2
CIS+TS+MVF	12.5	17.6	24.0	35.4	15.1	26.9	16.7	12.3	23.1	24.4	20.6
Qwen3-VL-30B-A3B-Instruct
Direct Answer	3.9	3.9	6.1	2.0	4.1	0.0	7.7	3.8	0.0	7.8	3.9
CIS+TS	17.0	19.6	17.6	16.3	20.0	5.9	14.6	10.2	5.8	44.2	17.2
CIS+TS+MVF	25.5	21.6	23.5	18.4	8.0	23.5	16.7	18.4	28.8	26.9	21.2
Qwen3-VL-235B-A22B-Instruct
Direct Answer	6.2	3.9	10.0	22.9	7.5	13.5	6.2	3.5	7.5	7.5	8.8
CIS+TS	25.2	19.5	24.0	21.1	18.5	17.1	10.7	29.1	16.6	31.5	21.2
CIS+TS+MVF	25.0	23.5	30.0	31.2	30.2	28.8	20.8	22.8	30.2	32.5	27.4

Table 4. Performance Comparison of Models Across Different Categories on VDR-Bench. Direct Answer: models directly answer VQA without search tools. CIS+TS: Cropped Image Search + Text Search. MVF: Multi-turn Visual Forcing strategy. The MVF strategy consistently improves performance across all models, with Gemini 2.5 Pro achieving the highest overall score (30.0%) after applying MVF.

BibTeX

@article{huang2026vision,
  title={Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Zeng, Yu and Wang, Qiuchen and Fang, Zhen and Cao, Shaosheng and Chu, Zheng and Yin, Qingyu and Chen, Shuang and Yin, Zhenfei and Chen, Lin and others},
  journal={arXiv preprint arXiv:2601.22060},
  year={2026}
}

@article{vdr-bench,
  title={VDR-Bench: Rethinking Visual and Textual Search for Multimodal Large Language Models},
  author={Zeng, Yu and Huang, Wenxuan and Fang, Zhen and Chen, Shuang and Shen, Yufan and Cai, Yishuo and Wang, Xiaoman and Yin, Zhenfei and Chen, Lin and Chen, Zehui and Huang, Shiting and Zhao, Yiming and Hu, Yao and Torr, Philip and Ouyang, Wanli and Cao, Shaosheng},
  journal={preprint},
  year={2026}
}