Ego4D's Short Term Action Anticipation

Summary

This work was led during my last year of engineering studies’ 6 months internship at I²R (within A*STAR), Singapore, under the supervision of Lai Xing Ng. I studied the state of the art of action prediction on egocentric (1st person POV) videos, to then take the Short Term Object Interaction Anticipation Challenge from the dataset and benchamrk suite Ego4D1. I implemented InternVideo’s solution2 using a Visual Transformer for video features extraction and a multimodal Transformer instead of the initial ROI Pooling operation for Verb/Time To Contact (TTC) features extraction, achieving 3rd place on the public leaderboard with the following metrics (not publicly posted):

mAPNounNoun_VerbTTCOverall
26.1511.259.224.75

I performed this work with an unexpected additional 2 months of remote setup from France because of a delay in the work pass delivery from Singapore – happening after an expected first month in remote for the theoretical preliminaries – and during which I managed to adapt practical tasks to the setup.


  1. Grauman, Kristen et al. “Ego4D: Around the World in 3,000 Hours of Egocentric Video.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 18973-18990. https://arxiv.org/abs/2110.07058↩︎

  2. Chen, Guo et al. “InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges.” (2022). https://arxiv.org/abs/2211.09529↩︎

Joceran Gouneau
Joceran Gouneau
CS Engineering Graduate - Listening for opportunities