I am a Master’s student in Applied Math at Fudan University and an incoming CS Ph.D. student at Yale University, advised by Prof. Rex Ying. My current research focuses on Multimodal Structured Reasoning, aiming to build systems that integrate explicit structural priors with dynamic reasoning capabilities. Currently, I am exploring multimodal code generation, post-training strategies for Vision-Language Models (VLMs), and modeling structured reasoning processes. Feel free to reach out if you’d like to learn more about my work, chat, or explore potential collaborations. You can find my publications on my google scholar.

🔥 News

  • 2026.05: BMC accepted to ICML 2026.
  • 2026.03: Two papers (MacTok and GIFT) accepted to CVPR 2026.
  • 2026.02: Joined Microsoft Research Asia (MSRA) as a Research Intern.
  • 2026.02: Two papers (ARTDECO and IVQ) accepted to ICLR 2026.
  • 2025.09: Two papers (GOOD and OrderMind) accepted to NeurIPS 2025.
  • 2025.09: Loong accepted to NeurIPS 2025 Workshop.
  • 2025.06: Dark-ISP accepted to ICCV 2025.
  • 2025.05: Joined Shanghai AILab as a Research Intern, working with Jie Fu.
  • 2025.04: Welcomed two cats into my life: Fellow and Putao 🐱🍇
  • 2025.01: MVP accepted to ICLR 2025.
  • 2024.12: Joined as a Research Intern, working with Prof. Chenyang Si and Prof. Ziwei Liu.

📝 Selected Publications

My current research interests primarily revolve around Multimodal Structured Reasoning. While I have explored a diverse range of directions in the past—from continuous visual generation to physical scene understanding—these endeavors have ultimately converged on a unified goal: solving the core challenges of multimodal alignment, abstraction, and complex reasoning.

ICLR 2025
sym

MVP: Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs

Xin Gao, Jian Pu

Multi-modal LearningLatent SpaceCross-modal Generation

GitHub Stars GitHub Forks [Project page]

  • MVP introduces a robust generative approach to incomplete multi-view representation learning by leveraging latent space correspondences in Variational Auto-Encoders. This enables the inference of missing modalities and enhances multi-view consistency even with irregularly missing information.
CVPR 2026
sym

MacTok: Robust Continuous Tokenization for Image Generation

Hengyu Zeng#, Xin Gao#, Guanghao Li, Yuxiang Yan, Jiaoyang Ruan, Junpeng Ma, Haoyu Albert Wang, Jian Pu

Continuous TokenizationPosterior CollapseImage Generation
  • MacTok addresses the severe “posterior collapse” failure in continuous image tokenizers. By introducing DINO-guided semantic masking and multi-level representation alignment, it forces the model to infer robust semantics from incomplete visual evidence, yielding state-of-the-art generation fidelity.
ICML 2026
sym

Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models

Jiaoyang Ruan#, Xin Gao#, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu

Diffusion Language ModelsReasoningSelf-Verification
  • BMC introduces a geometric perspective to reasoning in Diffusion Language Models. By framing verifiable trajectories as stable paths on a high-density manifold, it provides an unsupervised mechanism for logical self-verification, empowering models to dynamically verify generation steps and self-evolve.
NeurIPS 2025
sym

GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianxiong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, Chenyang Si

Diffusion ModelsDiffusion SamplingOOD Detection
  • GOOD is a training-free diffusion guidance framework that shapes a robust OOD/ID classification boundary. It steers sampling with image-level and feature-level gradients to systematically generate diverse, controllable OOD examples for robust decision-making.
NeurIPS 2025
sym

Learning Spatial-Aware Manipulation Ordering

Yuxiang Yan, Zhiyuan Zhou, Xin Gao, Guanghao Li, Shenglin Li, Jiaqi Chen, Qunyan Pu, Jian Pu

Spatial ReasoningVision-Language Models
  • OrderMind is a spatial-aware manipulation ordering framework that learns object priorities from local geometry via a kNN spatial graph and a lightweight module, supervised by VLM-distilled situational knowledge for robotic embodied tasks.
ICLR 2026
sym

Escaping Low-Rank Traps: Interpretable Visual Concept Learning via Implicit Vector Quantization

Shujian Gao, Yuan Wang, Chenglong Ma, Xin Gao, Jiangtao Yan, Junzhi Ning, Cheng Tang, Changkai Ji, Huihui Xu, Wei Li, Ziyan Huang et al.

Visual Concept LearningRepresentation Learning
  • IVQ acts as a structural regularizer to address representational collapse in Concept Bottleneck Models (CBMs). It anchors patch features to learned semantic prototypes continuously, preserving a high-rank, explainable representation space for visual concept learning.
ICCV 2025
sym

Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection

Jiasheng Guo#, Xin Gao#, Yuxiang Yan, Guanghao Li, Jian Pu

Image ProcessingPhysical PriorsObject Detection
  • Dark-ISP is a lightweight, self-adaptive ISP plugin that enhances low-light object detection by processing Bayer RAW images. It breaks down traditional pipelines into optimized sub-modules, using physics-informed priors for robust sensory perception.
ICLR 2026
sym

ARTDECO: Toward High-Fidelity On-the-Fly Reconstruction with Hierarchical Gaussian Structure and Feed-Forward Guidance

Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, Jiangmiao Pang

Scene Reconstruction3D Foundation Models

GitHub Stars GitHub Forks [Project page]

  • ARTDECO unifies feed-forward 3D foundation priors with SLAM-style global optimization. It utilizes a hierarchical 3D Gaussian scene representation to achieve efficient, robust, and high-fidelity on-the-fly monocular 3D reconstruction.
NeurIPS 2025 Workshop
sym

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao et al.

LLMReasoningRLVR

GitHub Stars GitHub Forks [Project page]

  • Loong is an open-source framework for scalable, verifiable Chain-of-Thought (CoT) reasoning data generation. It pairs a large benchmark with a modular generator forming an agent–environment loop for broad-domain reasoning training and verification.

🎖 Honors and Awards

  • 2023.09 Fudan University Zhicheng Freshman Second Prize Scholarship (Top 5%)
  • 2023.06 Outstanding Graduate of Shanghai
  • 2022.11 Second Prize in the Chinese Mathematics Competitions (Category A)
  • 2021.12 National Scholarship, China
  • 2021.09 National Second Prize in the China Undergraduate Mathematical Contest in Modeling
  • 2020.12 Shanghai Scholarship

👨‍💼 Academic Service

  • Conference Reviewer: NeurIPS 2025, ICCV 2025, ICLR 2026, ICML 2026 (Silver Reviewer), NeurIPS 2026

📖 Educations

  • 2023.09 - 2026.06 (now), Master of Applied Mathematics, Fudan University, Shanghai, China.
  • 2019.09 - 2023.06, Bachelor of Mathematics, Donghua Univeristy, Shanghai, China.

💻 Internships

📚 Learning Materials

😁 If you want the following material without watermarks, please contact me using the email address and specify your intended use.

Material 1: Frontiers in Diffusion Model Technologies (1)

Material 1 Image

This document provides an overview of key concepts related to diffusion models, particularly focusing on the theoretical foundations, development timeline, and recent advancements in the field. The content includes detailed discussions on VAE, DDPM, DDIM, SDE, and ODE, as well as conditional guidance. It also covers the evolution of stable diffusion, including topics like Latent Diffusion, VQ-VAE, and DiT. Lastly, the document highlights the latest methodology, IC-Light, set to be presented at ICLR 2025.

Material 2: Tutorial of Information Geometry and t3-VAE

Material 2 Image

This document introduces the t3-Variational Autoencoder (ICLR 2024), which uses Student’s t-distributions to model heavy-tailed data distributions and improve latent variable representations. It also explores the framework of Information Geometry, focusing on how generative models can be understood through statistical manifolds, divergences, and Riemannian metrics, providing a deeper understanding of probability distributions and their applications in machine learning, signal processing, and neuroscience.

Material 3: EM Algorithm and X-metric

Material 3 Image

This document introduces the X-metric framework (PAMI 2023), an N-dimensional information-theoretic approach designed for groupwise registration and deep combined computing, with applications in advanced machine learning tasks. It also covers the theoretical foundations, including entropy, mutual information, and the MLE algorithm, alongside the framework’s modifications for deep computing and network training.