This talk and its corresponding dissertation address the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image set via aggregating subset features given by regularizing normalized Convolutional Neural Networks (CNNs), (2) image set via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models with actions segmented by 3D Graph cuts and quantified by segmental 3D CNNs, respectively.
The state-of-the-art performances have been achieved for tasks such as quantifying actions of the facial pain and human diving. The primary conclusions of this dissertation are categorized as follows: (i) Image set models can effectively capture facial actions that are about collective representation; (ii) The sparse and low-rank representations of facial actions can have the expression, identity and poses cues untangled and can be learned using an image-set model and also a linear model; (iii) It is shown from face nets that norm is related with recognizability and that the similarity metric and loss function matter; (v) Combining the Multiple Instance Learning based boosting tracker with the Particle Filtering motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors and also it is feasible to learn knowledge such as shape priors online from the rich Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs and also 3D CNNs work effectively when inputted with temporally meaningful clips.
It is hoped that the proposed models will lead to working components in building face and gesture recognition systems. In addition, the models proposed for videos can be adapted to other modalities of sequential images such as hyperspectral images and volumetric medical images which are not included in this talk and the dissertation. The model of supervised hashing by jointly learning embedding and quantization is included in the dissertation but will not be presented in the talk in the interest of time.
Xiang Xiang is a PhD student in Computer Science at Johns Hopkins University, since 2012 with Gregory D. Hager (initial appointed advisor since 2012) and a PhD candidate in Computer Science since 2014 with Trac D. Tran (primary advisor since 2014) and Gregory D. Hager (co-advisor since 2014) co-listed as my official advisors. He received the B.S. degree from the School of Computer Science at Wuhan University, Wuhan, China, in 2009, the M.S. degree from the Institute of Computing Technology at Chinese Academy of Sciences, Beijing, China, in 2012, and the M.S.E. degree from Johns Hopkins University in 2014. His research interests are computer vision and machine learning with a focus on representation learning for video understanding, facial analysis, affective computing and bio-medical applications. He has been an active member of the DSP Lab (ECE), CIRL, LCSR, MCEH, CCVL, and CIS.
Professional plans after Hopkins: Xiang Xiang will join Stefano Soatto as an Applied Research Scientist at Amazon AI.