Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang

July, 2025

Abstract

Egocentric video reasoning focuses on the unseen, egocentric agent who shapes the scene, demanding inference of hidden intentions and fine-grained interactions—areas where current MLLMs struggle. We present EgoThinker, a framework that equips MLLMs with strong egocentric reasoning via spatio-temporal chain-of-thought supervision and a two-stage curriculum. We build EgoRe-5M, a large-scale QA dataset derived from 13M egocentric clips, featuring multi-minute segments with detailed rationales and dense hand–object grounding. Trained with SFT on EgoRe-5M and refined with RFT for better spatio-temporal localization, EgoThinker outperforms prior methods on multiple egocentric benchmarks and yields substantial gains in fine-grained localization tasks.

Type

Conference

Publication

NeurIPS 2025