Seminar

Beyond Action Recognition: Detailed Video Modeling

03/12/2021

Title

Speaker

Gül Varol - École des Ponts ParisTech

Abstract

In this talk, I will present some of our recent works on a variety of tasks in computer vision, in particular focusing on detailed video modeling. Action recognition has been a standard problem in the research community working on videos. However, there is more to learn in videos than a closed set of pre-defined semantic action categories. This talk will cover three different directions towards more detailed understanding of dynamic visual contents. (i) First, we will look at our end-to-end text-to-video retrieval approach that learns to map videos and textual descriptions into a joint space, and see the advantages of joint image and video training using transformers. (ii) Then, we will explore a more fine-grained problem of localising text in sign language videos, using weakly-aligned subtitles in sign language interpretation data, again in conjunction with transformers. (iii) Finally, we will go beyond semantics, and look at 3D reconstruction from video data for recovering detailed hand-object interactions, this time we will discuss the limitations of the learning-based methods due to lack of data, and opt for an optimization-based approach. Bain et al. “Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval”, ICCV 2021. Varol et al. “Read and Attend: Temporal Localisation in Sign Language Videos”, CVPR 2021. Bull et al. “Aligning Subtitles in Sign Language Videos”, ICCV 2021. Hasson et al. “Towards unconstrained joint hand-object reconstruction from RGB videos”, 3DV 2021.

Bio

Gül Varol is a research faculty at the IMAGINE team of École des Ponts ParisTech. Previously, she was a postdoctoral researcher at the University of Oxford (VGG). She obtained her PhD from the WILLOW team of Inria Paris and École Normale Supérieure (ENS). Her thesis received the ELLIS PhD Award. During her PhD, she spent time at MPI, Adobe, and Google. Her research is focused on human understanding in videos, specifically action recognition, body shape and motion analysis, and sign languages.

When

2021-12-03 at

Where

Remote, @UniGE