Thesis Students

Skeleton-based Action Recognition in Non-contextual, In-the-wild and Dense Joint Scenarios

Neel Trivedi

Abstract

Human action recognition, with its irrefutable and varied use cases across fields of surveillance, robotics, human object interaction analysis and many more, has gained critical importance and attention in the field of compute vision. Traditionally entirely based on RGB sequences, action recognition domain has shifted focus towards using skeleton sequences due to the easy availability of skeleton data capturing apparatus and the release of large scale datasets, in recent years. Skeleton based human action recognition, having superiority in terms of privacy, robustness and computational efficiency over traditional RGB based action recognition, is the primary focus of this thesis. Ever since the release of large scale skeleton action datasets namely NTURGB+D and NTURGB+D 120, the community has solely focused on developing complex approaches, ranging from CNNs to complex GCNs and more recently transformers, to achieve the best classification accuracy for these datasets. However, in this rat race for state of the art performance, the community turned a blind eye to a major drawback at the data level which bottlenecks even the most sophisticated approaches. This drawback is where we start our explorations in this thesis. The pose tree provided in the NTURGB+D datasets contains only 25 joints, out of which only 6 joints (3 for each hand) are finger joints. This is a major drawback since only 3 finger level joints are not sufficient enough to distinguish between action categories such as ”Thumbs up” and ”Thumbs down” or ”Make ok sign” and ”Make victory sign”. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTURGBD, NTU60-X and NTU120-X dataset include finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improving the state of the art performance, overall and on previously worst performing action categories. Pose-based action recognition is predominantly tackled by approaches that treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves the state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency make it an attractive choice for action recognition and for deployment on compute-restricted embedded and edge devices. Finally, we conclude this thesis by exploring new and more challenging frontiers under the umbrella of skeleton action recognition namely ”in the wild” skeleton action recognition and ”non-contextual” skeleton action recognition. We introduce Skeletics-152, a curated and 3D pose dataset derived from the RGB videos included in the larger Kinetics-700 dataset to explore in the wild skeleton action recognition. We further introduce, Skeleton-mimetics, a 3D pose dataset derived from recently introduced non-contextual action dataset-Mimetics. By benchmarking and analysing various approaches on these two new dataset we lay the ground for future exploration in these two challenging problems within skeleton action recognition. Overall in this thesis, we draw attention to prevailing drawbacks in the existing skeleton action datasets and introduce extensions of these datasets to counter their shortcomings. We also introduce a novel, efficient and highly reliable skeleton action recognition approach dubbed PSUMNet. Finally, we explore more challenging tasks of in the wild and non-contextual action recognition.

Year of completion:	September 2022
Advisor :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Casual Scene Capture and Editing for AR/VR Applications

Pulkit Gera

Abstract

Augmented Reality and Virtual Reality(AR/VR) applications can become far more widespread if they can photo-realistically capture our surroundings and modify them in different ways. It could include editing the scene’s lighting, changing the objects’ material, or augmenting virtual objects onto the scene. There has been a significant amount of work done in this domain. However, most of these works capture the data in a controlled setting consisting of expensive setups such as light stages. These methods are impractical and cannot scale. Thus, we must design solutions that capture scenes casually from offthe-shelf devices commonly available to the public. Further, the user should be able to interact with the captured scenes and modify these scenes in exciting directions, such as editing the material or augmenting new objects into the scene. In this thesis, we study how we can produce novel views of a casually captured scene and modify them in interesting ways. First, we present a neural rendering framework for simultaneous novel view synthesis and appearance editing of a casually captured scene using off-the-shelf smartphone cameras under known illumination. Existing approaches cannot perform novel view synthesis and edit the materials of the scene objects. We propose a method to explicitly disentangle appearance from lighting while estimating radiance and learn an independent lighting estimation of the scene. This allows us to generalize arbitrary changes in the scene’s materials while performing novel view synthesis. We demonstrate our results on synthetic and real scenes. Next, we present PanoHDR-NeRF, a neural representation of an indoor scene’s high dynamic range (HDR) radiance field that can be captured casually without elaborate setups or complex capture protocols. First, a user captures a low dynamic range (LDR) omnidirectional video of the scene by freely waving an off-the-shelf camera around the scene. Then, an LDR2HDR network converts the captured LDR frames to HDR, which are subsequently used to train a modified NeRF++ model. The resulting PanoHDR-NeRF representation can synthesize full HDR panoramas from any location in the scene. We also show that the HDR images produced by PanoHDR-NeRF can synthesize correct lighting effects, enabling the augmentation of indoor scenes with synthetic objects that are lit correctly. Through these works, we demonstrate how we can casually capture scenes for AR/VR applications that the user can further edit.

Year of completion:	August 2022
Advisor :	P J Narayanan, Jean-François Lalonde

Related Publications

Downloads

Towards Understanding Deep Saliency Prediction

Navyasri M

Abstract

Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this work, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for real-time applications. We also explore the possibility of improving saliency prediction in videos by using the image saliency models and existing work.

Year of completion:	May 2021
Advisor :	Vineet Gandhi

Related Publications

Downloads

Retinal Image Synthesis

Anurag Anil Deshmukh

Abstract

Medical imaging has been aiding diagnosis and treatment of diseases by creating visual representations of the interior of the human body. Experts hand-mark these images for abnormalities and diagnosis. Supplementing experts with these rich visualization has enabled detailed clinical analysis and rapid medical intervention. However, deep learning-based methods rely on abundantly large volumes of data for training. Procuring data for medical imaging applications is especially difficult because abnormal cases by definition are rare and the data, in general, requires experts for labelling. With Deep learning algorithms, data with high class imbalance or of insufficient variability leads to poor classification performance. Thus, alternate approaches like using generative modelling to artificially generate more data have been of interest. Most of these methods are GAN [11] based approaches. While they can be helpful with data imbalance they still require a lot of data to be able to generate realistic images. Additionally, a lot of these methods have been shown to work on natural images where the images are relatively noise-free and smaller artifacts aren’t as damaging. Thus, this thesis aims at providing synthesis methods which overcome the limitations of smaller datasets and noisy profile. We do this for two different modalities, Fundus imaging and Optical Coherence Tomography (OCT). Firstly, we present a fundus image synthesis method aimed at providing paired Optic Cup and Image data for Optic Cup (OC) Segmentation. The synthesis method works well on small datasets by minimising the information to be learnt by leveraging domain-specific knowledge and by providing most of the structural information to the network. We demonstrate this method’s advantages over a more direct synthesis method. We show how leveraging domain-specific knowledge can provide higher quality images and annotations. Inclusion of these generated images and their annotations in training of an OC segmentation model showed a significant improvement in performance, thus showing their reliability. Secondly, we present a novel unpaired image to image translation method which can introduce abnormality (Drusen) to OCT images while avoiding artifacts and preserving the noise profile. Comparison with other state-of-the-art images to image translation methods shows that our method is significantly better at preserving the noise profile and is better at generating morphologically accurate structures.

Year of completion:	April 2021
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Leveraging Structural Cues for Better Training and Deployment in Computer Vision

Shyamgopal Karthik

Abstract

With the growing use of computer vision tools in wide-ranging applications, it becomes imperative to understand and resolve issues in computer vision models when they are used in production settings for various applications. In particular, it is essential to understand that the model can be wrong quite frequently during deployment. Developing a better understanding of the mistakes made by a model can help mitigate and handle them without catastrophic consequences. To investigate the severity of mistakes, we first explore this in a simple classification setting. Even in this setting, understanding the severity of mistakes of difficult to quantify, especially since manually defining pairwise costs does not scale well for large-scale classification datasets. Therefore most works have used class taxonomies/hierarchies, which allow pairwise costs to be defined using graph distances. There has been increasing interest in building deep hierarchy-aware classifiers, aiming to quantify and reduce the severity of mistakes and not just count the number of errors. However, most of these works require the hierarchy to be available during training and cannot adapt to new hierarchies or even small modifications to the existing hierarchy without having to re-train the model. We explore a different direction for hierarchy-aware classification – amending mistakes by making post-hoc corrections by resorting to the classical Conditional Risk Minimization(CRM). Surprisingly, we find that this method is a far more suitable alternative than the works on deep hierarchy-aware classification; CRM preserves the base model’s top-1 accuracy and brings the most likely predictions of the model closer to the ground truth and is able to provide reliable probability estimates, unlike hierarchy-aware classifiers. We firmly believe that this serves as a very strong and useful baseline for future exploration in this direction. We turn our attention to a crucial problem in many video processing pipelines: visual(single) object tracking. In particular, we explore the long-term tracking scenario where given a target in the first frame of the video; the goal is to track the object throughout a (long) video during which the object may undergo occlusion, vary in appearance, or go out-of-view. The temporal aspect of videos also makes it an ideal scenario to understand the accumulation of errors that would not be otherwise seen if every image is independent. We hypothesize that there are three crucial abilities that a tracker must possess to be effective in the long-term tracking scenario, namely Re-Detection, Recovery and, Reliability. The tracker must be able to re-detect the target when the target goes out of the scene, and returns must recover from failure and track an object contiguously to be of practical utility. We propose a set of novel and comprehensive experiments to understand each of these aspects which give a thorough understanding of the strengths and limitations of various state-of-the-art tracking algorithms. We finally visit the problem of multi-object tracking. Unlike the problem of single-object tracking where the target is initialized in the first frame, the goal here is to track all objects of a particular category(such as pedestrians, vehicles, animals etc.). Since this problem does not require user-initialization, it has found use in wide-ranging real-time applications such as autonomous driving. The typical multiobject tracking pipeline follows the tracking-by-detection paradigm, i.e. an object detector is first used to detect all the objects in the scene. These detections are linked together to form the final trajectories using a combination of Spatio-temporal features and appearance/Re-Identification(ReID) features. The appearance features are extracted using a Convolutional Neural Network(CNN) trained on a corpus of labelled videos. Our central insight is that only the appearance model requires labelled videos in the entire pipeline, while the rest of the pipeline can be trained with just image-level supervision. Inspired by the recent successes in unsupervised contrastive learning which enforces the similarity in feature space between an image and its augmented version, we resort to a simple method that leverages the spatio-temporal consistency in videos to generate “natural” augmentations which are then used as pseudo-labels to train the appearance model. When integrated into the overall tracking pipeline, we find that this unsupervised appearance model can match the performance of its supervised counterparts in reducing the identity switches present in the trajectories, thereby saving costly video annotations that are impractical to scale up without sacrificing performance.

Year of completion:	April 2021
Advisor :	Vineet Gandhi

Skeleton-based Action Recognition in Non-contextual, In-the-wild and Dense Joint Scenarios

Neel Trivedi

Abstract

Related Publications

Downloads

Casual Scene Capture and Editing for AR/VR Applications

Pulkit Gera

Abstract

Related Publications

Downloads

Towards Understanding Deep Saliency Prediction

Navyasri M

Abstract

Related Publications

Downloads

Retinal Image Synthesis

Anurag Anil Deshmukh

Abstract

Related Publications

Downloads

Leveraging Structural Cues for Better Training and Deployment in Computer Vision

Shyamgopal Karthik

Abstract

Related Publications

Downloads

More Articles …