Wenxuan Huang (黄文轩)  

github Google Scholar Semantic Scholar ORCID

 

Master in East China Normal University

Personal Email : osilly0616 (at) gmail.com

Short Biography [CV]

I am the third-year master student in East China Normal University under the supervision of Prof. Shaohui Lin, while I am also a Research Assistant of MMLab@The Chinese University of Hong Kong and working with Prof. Wanli Ouyang. Furthermore, I focus on AI Research and work closely with industrial AI Laboratory, like NLP Team@Xiaohongshu.

If interested in collaboration or discussion, please email me.

Research Interest

My research interests broadly lie in the areas of Multimodal Large Language Models, especially Multimodal Reasoning Models. current and previous focal areas include:

Selected Publications [ Full List ]

( *Co-first Author, *Correspondence, Project Leader)
[Reasoning MLLM] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, Shaohui Lin
Citation 300+ & Star 700+ in six months, Preprint, first author
[ Paper ] [ Code ]
This is the first paper to explore how to effectively use R1-like RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reasoning capability.
[AIGC/Interleaving Reasoning/Unified MLLM] Interleaving Reasoning for Better Text-to-image Generation
Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Hangyu Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
Preprint, first author
[ Paper ] [ Code ]
This is an early exploration to introduce Interleaving Reasoning to Text-to-image Generation field and achieve the SoTA benchmark performance. It also significantly improves the quality, fine-grained details and aesthetic aspects of generated images.
[Efficient MLLM] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
International Conference on Learning Representations (ICLR) , 2025, first author
[ Paper ] [ Code ]
Dynamic-LLaVA is the first MLLM acceleration framework that simultaneously sparsifies both vision and language contexts while integrating inference efficiency optimization across different MLLM inference modes into a unified framework.
[Transformer Training Acceleration] A General and Efficient Training for Transformer via Token Expansion
Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2024, first author
[ Paper ] [ Code ]
We proposed one plug-and-play Transformer training acceleration framework, without twisting the original training hyper-parameters, architecture, and introducing additional training strategies.
[AI4Geophysic] An Intelligent First Arrival Picking Method of Microseismic Signals Based on the Small Sample Expansion
Wenxuan Huang, Guanqun Sheng, Xingong Tang, Kai Ma, Jingyi Lu, Hang Sun
IEEE Transactions on Geoscience and Remote Sensing (TGRS) , first author
[ Paper ] [ Code ]
We proposed one GAN to generation the microseismic samples under unsupervised conditions to expand the microseismic data having a limited number of samples. Then we use the enhanced first arrival picking network to improve the accuracy of first arrivals of low SNR microseismic signals.
[Reasoning MLLM] Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
Xiaoyu Zhan*, Wenxuan Huang*, Hao Sun*, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo
Conference on Neural Information Processing Systems (NeurIPS) , 2025, co-first author (second) & project leader
We attempt to solve that current MLLMs cannot effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning.
[MLLM] TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Ling You*, Wenxuan Huang*, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang.
ACM International Conference on Multimedia (ACMMM) , 2025, co-first author (second)
[ Paper ] [ Project Page ]
We propose the first end-to-end MLLM for soccer commentary generation, specifically designed for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. The model jointly predicts timestamps and generates captions in a single pass, enabling global context modeling over 45-minute matches.
[CNN Inference Acceleration] Filter Pruning for Efficient CNNs via Knowledge-driven Differential Filter Sampler
Shaohui Lin, Wenxuan Huang, Jiao Xie, Baochang Zhang, Yunhang Shen, Zhou Yu, Jungong Han, David Doermann
Preprint, first student author (first author is my advisor)
[ Paper ] [ Code ]
We propose a unified CNN pruning framework directly optimized end-to-end with a global pruning constraint.
[MLLM] LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Bangyan Li*, Wenxuan Huang**, Zhenkun Gao, Yeqiang Wang, Yunhang Shen, Jingzhong Lin, Ling You, Yuxiang Shen, Shaohui Lin, Wanli Ouyang, Yuling Sun
Preprint, co-first author (second) & corresponding author
[ Paper ]
We convert generative models to discriminative models to address the limitation that current MLLMs cannot effectively tackle zero-shot radiology recognition.
Label: Wenxuan Huang proposed the main idea and designed the experiments, contributing to the discussion of this paper. Bangyan Li refined and finalized the idea, implemented the code and experiments, and was responsible for writing the manuscript
[Image Editing Benchmark] CompBench: Benchmarking Complex Instruction-guided Image Editing
Bohan Jia*, Wenxuan Huang*, Yuntian Tang*, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Zihan Wang, Yuan Xie, Shaohui Lin
Preprint, co-first author (second)
[ Paper ] [ Code ]
We propose the first benchmark for complex instruction-guided image editing.
[CLIP Inference Acceleration] CLIP-Map: Structured Matrix Adaptation for Parameter-Efficient CLIP Compression
Kangjie Zhang*, Wenxuan Huang*, Xin Zhou, Boxiang Zhou, Dejia Song, Yuan Xie, Baochang Zhang, Lizhuang Ma, Nemo Chen, Xu Tang, Yao Hu, Shaohui Lin
Preprint, co-first author (second)
We propose the first mapping-based CLIP compression framework that maps CLIP parameters to a smaller representation, thereby accelerating inference.
Label: Wenxuan Huang proposed the main idea and designed the experiments, contributing to the discussion of this paper. Kangjie Zhang refined and finalized the idea, implemented the code and experiments, and was responsible for writing the manuscript
[Agentic RL/Agency task] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
Yu Zeng*, Wenxuan Huang*, Shiting Huang*, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao
Preprint, co-first author (second)
[ Paper ] [ Code ]
We introduce agentic jigsaw interaction learning to enhance visual perception and reasoning in MLLMs without VQA labels during training, demonstrating strong generalization across 9 general vision tasks.
[Reasoning MLLM] Exploring End-to-End Paradigms for Visual Chinese Grammatical Error Correction
Xiaoman Wang, Wenxuan Huang, Wenbiao Tao, Yike Zhao, Yaohui Liu, Yunshi Lan, Weining Qian
Preprint, second author
We explore how to fine-turn one MLLM to solve the complex perceptron task.

Honors and Awards

Academic (Undergraduate/Master's/PhD) Honors

  • National Scholarship awarded by the Ministry of Education (Top 0.2%), 2025
  • "Panshi" Scholarship, 2024
  • National Scholarship awarded by the Ministry of Education (Top 0.2%), 2021
  • Yangtze River Power Scholarship, 2020
  • Last Updated on 11th Oct, 2025

    Published with GitHub Pages