活动详情

第一人称AI的十年:从第一人称视觉系统到具身智能

日期:2026/05/07 – 2026/05/07

Abstract

The cameras we wear see what we see — and they record signals that no third-person dataset can offer: where our eyes go, what our hands reach for, and how a complex task unfolds, step by step. Over the past decade, egocentric computer vision has grown from a niche curiosity into a foundational discipline for understanding human behavior and, increasingly, for teaching machines to act in the physical world.

In this talk, I will trace this trajectory through three interlocking threads from my own research. First, data: large-scale egocentric datasets such as Ego4D and EgoExoLearn that reframed what perception models should learn — not merely object categories, but attention, intention, and procedural structure. Second, algorithms: our work on hand–object interaction, gaze prediction, and IMU–vision multimodal fusion, demonstrating how first-person priors close perceptual gaps that purely visual models cannot. Third, systems: wearable AI prototypes deployed in real-world settings, where on-device inference, sensor co-design, and human-centered evaluation matter as much as model accuracy.

I will close by sketching what I see as the natural next step: using the egocentric record of human experience as scalable supervision for embodied AI : agents that learn dexterous skills by watching people, plan through interactive world models, and act in the physical world with the situational awareness of a skilled apprentice. The throughline is a single conviction: the path to embodied intelligence runs through the first-person record of how humans actually live and work.

 

 

Biography

Yifei Huang is a researcher specializing in egocentric vision and interactive intelligence, with a focus on bridging perception, cognition, and action in AI systems. His work centers on first-person visual understanding, human behavior modeling, and human–AI co-creation, aiming to build agents with long-term memory and adaptive capabilities.

He received his Ph.D. in Information Science and Technology from the University of Tokyo under Prof. Yoichi Sato, and his B.Eng. in Automation from Shanghai Jiao Tong University. Huang has led and contributed to multiple large-scale research initiatives, including the development of the Ego4D and Ego-Exo4D datasets, and his work has been recognized at top-tier venues such as CVPR.

His research spans video prediction, action recognition, intention understanding, and multimodal foundation models. He has served as an area chair and reviewer for leading conferences and journals including CVPR, ICCV, ICLR and NeurIPS.