Gesture Recognition in Large Video Corpora


The Red Hen Lab is global transdisciplinary network of scientists working on multimodal language processing from big data. For that, they develop a host of data science tools to analyze multimodal communication in vast corpora, such as videos from news broadcasts or talkshows. In my position as student research assistant, I worked on a pipeline that automatically detects hand gestures in large datasets of videos, involving OpenPose for Pose Recognition, Person detection, Person tracking, Scene detection and more followed by a custom pipeline, and deployed it on a HPC using Snakemake and Singularity to create a multimodal communication corpus from the Ellen DeGeneres show.

The Ellen DeGeneres Project

The Ellen DeGeneres Show dataset consists of 30 videos available on various online streaming platforms, collected by Red Hen Lab. These videos have been independently annotated by professional gesture annotator Yao Tong using the Elan software. In general, the show usually has Ellen DeGeneres and US celebrities including musicians, politicians, and actors/actresses. While speaking, a lot of times the speakers show hand movement in correlation to their speech. This hand movement is a point of interest in our research since it can be related to speech in many ways. Also, it can help to create automated annotations based on hand gestures.[1]https://sites.google.com/case.edu/techne-public-site/datasets/gsoc21-ellen-gestures

  • This Thesis is the first work on a gesture recognition pipeline from Red Hen Lab
  • The Video Processing Pipeline is described here
  • Manual annotation and inspection of annotations was performed using ELAN (Instructions)
  • How Singularity is used
  • Case’s HPC Intro
  • The tool Skelshop and its Docs
  • This appears to be Red Hens latest work on the Ellen PipelineA benchmark deep learning model was developed on this dataset to identify hand gestures.
    In this project, the model has a 2 step process. Firstly, OpenPose keypoints are generated for the
    persons in the video. Then they are parsed and divided into 3D time-series data of the form
    (no-keypoints X no-of-persons X window-size). This data is fed into a model consisting of
    2D Convolutional-LSTM layers (ConvLSTM2D in Keras) and 3D Convolutional layers […]
    The code is deployed in a Singularity Container available on the Case HPC Cluster.
    The best performance obtained was Accuracy: 0.7207, Precision: 0.7581, and Recall: 0.6684.
    https://sites.google.com/case.edu/techne-public-site/datasets/gsoc21-ellen-gestures