Zihui Xue

Zihui (Sherry) Xue

Hi, I am Zihui Xue (薛子慧), and I usually go by Sherry. I am final-year Ph.D. candidate at UT Austin, advised by Prof. Kristen Grauman. I am currently a Student Researcher at Google DeepMind in London. Previously, I was a visiting researcher at FAIR, Meta AI (2023-2025) and had the pleasure of collaborating with Prof. Hang Zhao on multimodal learning (2020-2021). I obtained my bachelor's degree from Fudan University in 2020.

My research interests lie in video understanding and multimodal learning, with a recent focus on video-language models. I am on the 2026 job market and seeking full-time positions.

Email | CV | Scholar | Github

News

[Nov. 2025] Check out our work on exploring the role of camera pose trajectory in video perception: Seeing without Pixels 📸.

[Oct. 2025] Selected as MIT EECS Rising Star 🌟.

[Sep. 2025] AoT and VER got accepted by NeurIPS. See you in San Diego 🌴.

[May 2025] Excited to spend the summer in London 🇬🇧 working at Google DeepMind as a Student Researcher.

[Feb. 2025] Two accepted papers at CVPR: ProgressCaptioner and Viewpoint Rosetta Stone (oral).

[Sep. 2024] HOI-Swap is accepted by NeurIPS'24. See you in Vancouver 🎿.

[Jul. 2024] Two accepted papers at ECCV'24: Action2Sound (oral) and Exo2Ego.

[Feb. 2024] Three papers (one first-author) got accepted by CVPR'24. See you in Seattle ☕️.

Selected Projects

Video Learning

Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
arXiv, 2025 [paper] [webpage]

Seeing the Arrow of Time in Large Multimodal Models

Zihui Xue, Mi Luo, Kristen Grauman
NeurIPS, 2025 [paper] [webpage]

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
NeurIPS, 2025 [paper] [webpage]

Vid2Coach: Transforming How-To Videos into Task Assistants

Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, Amy Pavel
UIST, 2025 [paper] [webpage]

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman
CVPR, 2025 [paper] [webpage]

Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning

Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
CVPR, 2025 (Oral) [paper] [webpage]

HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman
NeurIPS, 2024 [paper] [webpage]

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen*, Puyuan Peng*, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
ECCV, 2024 (Oral) [paper] [webpage]

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
ECCV, 2024 [paper]

Learning Object State Changes in Videos: An Open-World Perspective

Zihui Xue, Kumar Ashutosh, Kristen Grauman
CVPR, 2024 [paper] [webpage]

Ego-Exo4D: Understanding Skilled Human Activity from First-and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, ..., Zihui Xue, et al.
CVPR, 2024 (Oral) [paper] [webpage] [blog]

Detours for Navigating Instructional Videos

Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman
CVPR, 2024 (Highlight) [paper]

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

Zihui Xue, Kristen Grauman
NeurIPS, 2023 [paper] [webpage]

Egocentric Video Task Translation

Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
CVPR 2023 (Hightlight) [paper] [webpage]

Multimodal perception and self-supervised learning

The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation

Zihui Xue*, Zhengqi Gao* Sucheng Ren*, Hang Zhao
ICLR, 2023 (top-5%) [paper] [webpage]
When is crossmodal knowledge distillation helpful?

Dynamic Multimodal Fusion

Zihui Xue, Radu Marculescu
CVPR MULA workshop, 2023 [paper]
Adaptively fuse multimodal data and generate data-dependent forward paths during inference time.

What Makes Multi-Modal Learning Better than Single (Provably)

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, Longbo Huang
NeurIPS, 2021 [paper]
Can multimodal learning provably perform better than unimodal?

Multimodal Knowledge Expansion

Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
ICCV, 2021 [paper] [webpage]
A knowledge distillation-based framework to effectively utilize multimodal data without requiring labels.

On Feature Decorrelation in Self-Supervised Learning

Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, Hang Zhao
ICCV, 2021 (Oral, Acceptance Rate 3.0%) [paper] [webpage]
Reveal the connection between model collapse and feature correlations!