Rawal Khirodkar
email

| CV | Google Scholar |
| Github | Twitter |

I am a Postdoctoral Research Scientist at Meta Reality Labs Research. I received my Ph.D. in Robotics (2017-2023) from Carnegie Mellon University for my work on multi-human 3D reconstruction from in-the-wild videos advised by Prof. Kris Kitani. I obtained my B.Tech (2013-2017) in Computer Science from IIT Bombay advised by Prof. Ganesh Ramakrishnan. My Ph.D. research has been supported in part by the Gov. of India Fellowship and the Amazon Fellowship.

My research is focused on reconstructing humans in 3D from natural videos captured in the real world. To build such systems, we need to unlock web-scale 3D data of humans in the wild. My current projects, therefore, span developing methods and infrastructure to obtain ground-truth data useful in estimating 2D/3D pose and mesh of multiple humans in unchoreographed dynamic settings. These projects are part of my longer-term goal of constructing 3D human avatars in the wild with details like hair, clothing, teeth, etc, from monocular videos.

  News
  Publications
sym

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman et al.
CVPR 2024 (Oral). Oral/Accepted: 90/2719 = 3.3%.

project page | pdf | abstract | blog | video

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

          @article{grauman2023ego,
            title={Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives},
            author={Grauman, Kristen and Westbury, Andrew and Torresani, Lorenzo and Kitani, Kris and Malik, Jitendra and Afouras, Triantafyllos and Ashutosh, Kumar and Baiyya, Vijay and Bansal, Siddhant and Boote, Bikram and others},
            journal={arXiv preprint arXiv:2311.18259},
            year={2023}
          }
        
sym

Real-Time Simulated Avatar from Head-Mounted Sensors
Zhengyi Luo, Jinkun Cao, Rawal Khirodkar, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu
CVPR 2024 (Poster Highlight). Highlight/Accepted: 324/2719 = 11.9%.

sym

Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras
Yu-Jhe Li, Yan Xu, Rawal Khirodkar, Jinhyung Park, Kris Kitani
arXiv, 2024

pdf | abstract | bibtex

We tackle the task of multi-view, multi-person 3D human pose estimation from a limited number of uncalibrated depth cameras. Recently, many approaches have been proposed for 3D human pose estimation from multi-view RGB cameras. However, these works (1) assume the number of RGB camera views is large enough for 3D reconstruction, (2) the cameras are calibrated, and (3) rely on ground truth 3D poses for training their regression model. In this work, we propose to leverage sparse, uncalibrated depth cameras providing RGBD video streams for 3D human pose estimation. We present a simple pipeline for Multi-View Depth Human Pose Estimation (MVD-HPE) for jointly predicting the camera poses and 3D human poses without training a deep 3D human pose regression model. This framework utilizes 3D Re-ID appearance features from RGBD images to formulate more accurate correspondences (for deriving camera positions) compared to using RGB-only features. We further propose (1) depth-guided camera-pose estimation by leveraging 3D rigid transformations as guidance and (2) depth-constrained 3D human pose estimation by utilizing depth-projected 3D points as an alternative objective for optimization. In order to evaluate our proposed pipeline, we collect three video sets of RGBD videos recorded from multiple sparse-view depth cameras, and ground truth 3D poses are manually annotated. Experiments show that our proposed method outperforms the current 3D human pose regression-free pipelines in terms of both camera pose estimation and 3D human pose estimation..

          @article{li2024multi,
            title={Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras},
            author={Li, Yu-Jhe and Xu, Yan and Khirodkar, Rawal and Park, Jinhyung and Kitani, Kris},
            journal={arXiv preprint arXiv:2401.15616},
            year={2024}
          }
        
sym

Multi-Human 3D Reconstruction from In-the-Wild Videos
Committee members: Kris Kitani, Deva Ramanan, Angjoo Kanazawa, Shubham Tulsiani

sym

EgoHumans: An Egocentric 3D Multi-Human Benchmark
Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, Kris Kitani
ICCV 2023 (Oral). Oral/Accepted: 195/2160 = 9.0%.

project page | pdf | abstract | bibtex | code

We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing soccer, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6 IDF1 and 9.3 HOTA on the EgoHumans dataset.

            @article{khirodkar2023egohumans,
              title={EgoHumans: An Egocentric 3D Multi-Human Benchmark},
              author={Khirodkar, Rawal and Bansal, Aayush and Ma, Lingni and Newcombe, Richard and Vo, Minh and Kitani, Kris},
              journal={arXiv preprint arXiv:2305.16487},
              year={2023}
            }
          
sym

Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Jinkun Cao, Xinshuo Weng, Rawal Khirodkar, Jiangmiao Pang, Kris Kitani
CVPR 2023

pdf | abstract | bibtex | code

Multi-Object Tracking (MOT) has rapidly progressed with the development of object detection and re-identification. However, motion modeling, which facilitates object association by forecasting short-term trajectories with past observations, has been relatively under-explored in recent years. Current motion models in MOT typically assume that the object motion is linear in a small time window and needs continuous observations, so these methods are sensitive to occlusions and non-linear motion and require high frame-rate videos. In this work, we show that a simple motion model can obtain state-of-the-art tracking performance without other cues like appearance. We emphasize the role of "observation" when recovering tracks from being lost and reducing the error accumulated by linear motion models during the lost period. We thus name the proposed method as Observation-Centric SORT, OC-SORT for short. It remains simple, online, and real-time but improves robustness over occlusion and non-linear motion. It achieves 63.2 and 62.1 HOTA on MOT17 and MOT20, respectively, surpassing all published methods. It also sets new states of the art on KITTI Pedestrian Tracking and DanceTrack where the object motion is highly non-linear.

         @article{cao2022observation,
            title={Observation-centric sort: Rethinking sort for robust multi-object tracking},
            author={Cao, Jinkun and Weng, Xinshuo and Khirodkar, Rawal and Pang, Jiangmiao and Kitani, Kris},
            journal={arXiv preprint arXiv:2203.14360},
            year={2022}
          }
        
sym

Sequential Ensembling for Semantic Segmentation
Rawal Khirodkar, Brandon Smith, Siddhartha Chandra, Amit Agrawal, Antonio Criminisi
arXiv, 2022

pdf | abstract | bibtex

Ensemble approaches for deep-learning-based semantic segmentation remain insufficiently explored despite the proliferation of competitive benchmarks and downstream applications. In this work, we explore and benchmark the popular ensembling approach of combining predictions of multiple, independently-trained, state-of-the-art models at test time on popular datasets. Furthermore, we propose a novel method inspired by boosting to sequentially ensemble networks that significantly outperforms the naive ensemble baseline. Our approach trains a cascade of models conditioned on class probabilities predicted by the previous model as an additional input. A key benefit of this approach is that it allows for dynamic computation offloading, which helps deploy models on mobile devices. Our proposed novel ADaptive modulatiON (ADON) block allows spatial feature modulation at various layers using previous-stage probabilities. Our approach does not require sophisticated sample selection strategies during training and works with multiple neural architectures. We significantly improve over the naive ensemble baseline on challenging datasets such as Cityscapes, ADE-20K, COCO-Stuff, and PASCAL-Context and set a new state-of-the-art.

         @article{khirodkar2022sequential,
              title={Sequential Ensembling for Semantic Segmentation},
              author={Khirodkar, Rawal and Smith, Brandon and Chandra, Siddhartha and Agrawal, Amit and Criminisi, Antonio},
              journal={arXiv preprint arXiv:2210.05387},
              year={2022}
            }
        
sym

Occluded Human Mesh Recovery
Rawal Khirodkar, Shashank Tripathi, Kris Kitani
CVPR 2022

project page | pdf | abstract | bibtex

Top-down methods for monocular human mesh recovery have two stages: (1) detect human bounding boxes; (2) treat each bounding box as an independent single-human mesh recovery task. Unfortunately, the single-human assumption does not hold in images with multi-human occlusion and crowding. Consequently, top-down methods have difficulties in recovering accurate 3D human meshes under severe person-person occlusion. To address this, we present Occluded Human Mesh Recovery (OCHMR) - a novel top-down mesh recovery approach that incorporates image spatial context to overcome the limitations of the single-human assumption. The approach is conceptually simple and can be applied to any existing top-down architecture. Along with the input image, we condition the top-down model on spatial context from the image in the form of body-center heatmaps. To reason from the predicted body centermaps, we introduce Contextual Normalization (CoNorm) blocks to adaptively modulate intermediate features of the top-down model. The contextual conditioning helps our model disambiguate between two severely overlapping human bounding-boxes, making it robust to multi-person occlusion. Compared with state-of-the-art methods, OCHMR achieves superior performance on challenging multi-person benchmarks like 3DPW, CrowdPose and OCHuman. Specifically, our proposed contextual reasoning architecture applied to the SPIN model with ResNet-50 backbone results in 75.2 PMPJPE on 3DPW-PC, 23.6 AP on CrowdPose and 37.7 AP on OCHuman datasets, a significant improvement of 6.9 mm, 6.4 AP and 20.8 AP respectively over the baseline.

         @article{khirodkar2022occluded,
          title={Occluded Human Mesh Recovery},
          author={Khirodkar, Rawal and Tripathi, Shashank and Kitani, Kris},
          journal={arXiv preprint arXiv:2203.13349},
          year={2022}
        }
        
sym

Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation
Rawal Khirodkar, Visesh Chari, Amit Agrawal, Ambrish Tyagi
ICCV 2021

project page | Amazon Science | pdf | abstract | bibtex | code

A key assumption of top-down human pose estimation approaches is their expectation of having a single person/instance present in the input bounding box. This often leads to failures in crowded scenes with occlusions. We propose a novel solution to overcome the limitations of this fundamental assumption. Our Multi-Instance Pose Network (MIPNet) allows for predicting multiple 2D pose instances within a given bounding box. We introduce a Multi-Instance Modulation Block (MIMB) that can adaptively modulate channel-wise feature responses for each instance and is parameter efficient. We demonstrate the efficacy of our approach by evaluating on COCO, CrowdPose, and OCHuman datasets. Specifically, we achieve 70.0 AP on CrowdPose and 42.5 AP on OCHuman test sets, a significant improvement of 2.4 AP and 6.5 AP over the prior art, respectively. When using ground truth bounding boxes for inference, MIPNet achieves an improvement of 0.7 AP on COCO, 0.9 AP on CrowdPose, and 9.1 AP on OCHuman validation sets compared to HRNet. Interestingly, when fewer, high confidence bounding boxes are used, HRNet's performance degrades (by 5 AP) on OCHuman, whereas MIPNet maintains a relatively stable performance (drop of 1 AP) for the same inputs.

         @InProceedings{Khirodkar_2021_ICCV,
              author    = {Khirodkar, Rawal and Chari, Visesh and Agrawal, Amit and Tyagi, Ambrish},
              title     = {Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation},
              booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
              month     = {October},
              year      = {2021},
              pages     = {3122-3131}
          }
        
sym

RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering
Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota, Kris M. Kitani
ICCV 2021

pdf | abstract | bibtex | code

The use of iterative pose refinement is a critical processing step for 6D object pose estimation, and its performance depends greatly on one's choice of image representation. Image representations learned via deep convolutional neural networks (CNN) are currently the method of choice as they are able to robustly encode object keypoint locations. However, CNN-based image representations are computational expensive to use for iterative pose refinement, as they require that image features are extracted using a deep network, once for the input image and multiple times for rendered images during the refinement process. Instead of using a CNN to extract image features from a rendered RGB image, we propose to directly render a deep feature image. We call this deep texture rendering, where a shallow multi-layer perceptron is used to directly regress a view invariant image representation of an object. Using an estimate of the pose and deep texture rendering, our system can render an image representation in under 1ms. This image representation is optimized such that it makes it easier to perform nonlinear 6D pose estimation by adding a differentiable Levenberg-Marquardt optimization network and back-propagating the 6D pose alignment error. We call our method, RePOSE, a Real-time Iterative Rendering and Refinement algorithm for 6D POSE estimation. RePOSE runs at 71 FPS and achieves state-of-the-art accuracy of 51.6% on the Occlusion LineMOD dataset - a 4.1% absolute improvement over the prior art, and comparable performance on the YCB-Video dataset with a much faster runtime than the other pose refinement methods.

          @InProceedings{Iwase_2021_ICCV,
              author    = {Iwase, Shun and Liu, Xingyu and Khirodkar, Rawal and Yokota, Rio and Kitani, Kris M.},
              title     = {RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering},
              booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
              month     = {October},
              year      = {2021},
              pages     = {3303-3312}
          }
        
sym

Adversarial Domain Randomization
Rawal Khirodkar, Kris M. Kitani
arXiv, 2018

pdf | abstract | bibtex

Domain Randomization (DR) is known to require a significant amount of training data for good performance. We argue that this is due to DR's strategy of random data generation using a uniform distribution over simulation parameters, as a result, DR often generates samples which are uninformative for the learner. In this work, we theoretically analyze DR using ideas from multi-source domain adaptation. Based on our findings, we propose Adversarial Domain Randomization (ADR) as an efficient variant of DR which generates adversarial samples with respect to the learner during training. We implement ADR as a policy whose action space is the quantized simulation parameter space. At each iteration, the policy's action generates labeled data and the reward is set as negative of learner's loss on this data. As a result, we observe ADR frequently generates novel samples for the learner like truncated and occluded objects for object detection and confusing classes for image classification. We perform evaluations on datasets like CLEVR, Syn2Real, and VIRAT for various tasks where we demonstrate that ADR outperforms DR by generating fewer data samples.

          @article{DBLP:journals/corr/abs-1812-00491,
            author    = {Rawal Khirodkar and
                         Donghyun Yoo and
                         Kris M. Kitani},
            title     = {{VADRA:} Visual Adversarial Domain Randomization and Augmentation},
            journal   = {CoRR},
            volume    = {abs/1812.00491},
            year      = {2018},
            url       = {http://arxiv.org/abs/1812.00491},
            archivePrefix = {arXiv},
            eprint    = {1812.00491},
            timestamp = {Tue, 01 Jan 2019 15:01:25 +0100},
            biburl    = {https://dblp.org/rec/journals/corr/abs-1812-00491.bib},
            bibsource = {dblp computer science bibliography, https://dblp.org}
          }
        
sym

Domain Randomization for Scene-Specific Car Detection and Pose Estimation
Rawal Khirodkar, Donghyun Yoo, Kris M. Kitani
WACV 2019

pdf | abstract | bibtex | code

We address the issue of domain gap when making use of synthetic data to train a scene-specific object detector and pose estimator. While previous works have shown that the constraints of learning a scene-specific model can be leveraged to create geometrically and photometrically consistent synthetic data, care must be taken to design synthetic content which is as close as possible to the real-world data distribution. In this work, we propose to solve domain gap through the use of appearance randomization to generate a wide range of synthetic objects to span the space of realistic images for training. An ablation study of our results is presented to delineate the individual contribution of different components in the randomization process. We evaluate our method on VIRAT, UA-DETRAC, EPFL-Car datasets, where we demonstrate that using scene specific domain randomized synthetic data is better than fine-tuning off-the-shelf models on limited real data.

          @inproceedings{khirodkar2019domain,
            title={Domain randomization for scene-specific car detection and pose estimation},
            author={Khirodkar, Rawal and Yoo, Donghyun and Kitani, Kris},
            booktitle={2019 IEEE Winter Conference on Applications of Computer Vision (WACV)},
            pages={1932--1940},
            year={2019},
            organization={IEEE}
          }
        
Professional Services
sym
  • CVPR (2018, 2020, 2021, 2022, 2023, 2024)
  • ICCV (2019, 2021, 2023)
  • ECCV (2020, 2022)
  • NeurIPS (2022, 2023)
  • ICLR (2022, 2023, 2024)
  • IJCV (2022)
  • TPAMI (2023)
  • WACV (2021, 2022, 2023, 2024)
  • ACCV (2018, 2020, 2021)

Template modified from this, and this