Ego4D's Short Term Action Anticipation

Joceran Gouneau, Lai Xing Ng

Sep 23, 2023 research, internship

Summary

This work was led during my last year of engineering studies’ 6 months internship at I²R (within A*STAR), Singapore, under the supervision of Lai Xing Ng. I studied the state of the art of action prediction on egocentric (1st person POV) videos, to then take the Short Term Object Interaction Anticipation Challenge from the dataset and benchamrk suite Ego4D¹. I implemented InternVideo’s solution² using a Visual Transformer for video features extraction and a multimodal Transformer instead of the initial ROI Pooling operation for Verb/Time To Contact (TTC) features extraction, achieving 3rd place on the public leaderboard with the following metrics (not publicly posted):

mAP	Noun	Noun_Verb	TTC	Overall
	26.15	11.25	9.22	4.75

I performed this work with an unexpected additional 2 months of remote setup from France because of a delay in the work pass delivery from Singapore – happening after an expected first month in remote for the theoretical preliminaries – and during which I managed to adapt practical tasks to the setup.

Grauman, Kristen et al. “Ego4D: Around the World in 3,000 Hours of Egocentric Video.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 18973-18990. https://arxiv.org/abs/2110.07058. ↩︎
Chen, Guo et al. “InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges.” (2022). https://arxiv.org/abs/2211.09529. ↩︎

research internship

Ego4D's Short Term Action Anticipation

Summary

Joceran Gouneau

CS Engineering Graduate - Listening for opportunities