Rawal Khirodkar

Rawal Khirodkar
email:

I am a Research Scientist at Meta. I received Ph.D. in Robotics (2017-2023) from Carnegie Mellon University advised by Prof. Kris Kitani. I did Bachelors (2013-2017) in Computer Science from IIT Bombay.

My research is focused on reconstructing photorealistic humans in 3D from real world videos. If you are interested in interning at Meta with me, please send an email with your resume and research interests.

News

March, 2025. Pippo accepted to CVPR 2025.
Jan, 2025. Co-organizing the 2nd Edition of Humans Workshop at CVPR 2025.
Sept, 2024. Harmony4D accepted to NeurIPS 2024.
Aug, 2024. Sapiens accepted to ECCV 2024 - Best Paper Candidate.
July, 2024. URAvatar accepted to SIGGRAPH Asia 2024.
June, 2024. EgoHumans received the EgoVis Distinguished Paper Award at CVPR 2024.

Selected Publications

	Pippo : High-Res Multi-View Humans from a Single Image Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov CVPR 2025. project page \| pdf \| abstract \| code We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image. @article{kant2025pippo, title={Pippo: High-Resolution Multi-View Humans from a Single Image}, author={Kant, Yash and Weber, Ethan and Kim, Jin Kyu and Khirodkar, Rawal and Zhaoen, Su and Martinez, Julieta and Gilitschenski, Igor and Saito, Shunsuke and Bagautdinov, Timur}, journal={arXiv preprint arXiv:2502.07785}, year={2025} }
	Harmony4D: A Video Dataset for In-The-Wild Close Human Interactions Rawal Khirodkar, Jyun-Ting Song, Jinkun Cao, Zen Luo, Kris Kitani NeurIPS 2024, Datasets and Benchmarks Track. project page \| pdf \| abstract \| code \| data Understanding how humans interact with each other is key to building realistic multi-human virtual reality systems. This area remains relatively unexplored due to the lack of large-scale datasets. Recent datasets focusing on this issue mainly consist of activities captured entirely in controlled indoor environments with choreographed actions, significantly affecting their diversity. To address this, we introduce Harmony4D, a multi-view video dataset for human-human interaction featuring in-the-wild activities such as wrestling, dancing, MMA, and more. We use a flexible multi-view capture system to record these dynamic activities and provide annotations for human detection, tracking, 2D/3D pose estimation, and mesh recovery for closely interacting subjects. We propose a novel markerless algorithm to track 3D human poses in severe occlusion and close interaction to obtain our annotations with minimal manual intervention. Harmony4D consists of 1.66 million images and 3.32 million human instances from more than 20 synchronized cameras with 208 video sequences spanning diverse environments and 24 unique subjects. We rigorously evaluate existing state-of-the-art methods for mesh recovery and highlight their significant limitations in modeling close interaction scenarios. Additionally, we fine-tune a pre-trained HMR2.0 model on Harmony4D and demonstrate an improved performance of 54.8% PVE in scenes with severe occlusion and contact. @article{khirodkar2024harmony4d, title={Harmony4D: A Video Dataset for In-The-Wild Close Human Interactions}, author={Khirodkar, Rawal and Song, Jyun-Ting and Cao, Jinkun and Luo, Zhengyi and Kitani, Kris}, journal={arXiv preprint arXiv:2410.20294}, year={2024} }
	URAvatar: Universal Relightable Gaussian Codec Avatars Junxuan Li, Chen Cao, Gabriel Schwartz, Rawal Khirodkar, Christian Richardt, Tomas Simon, Yaser Sheikh, and Shunsuke Saito SIGGRAPH ASIA 2024. project page \| pdf \| abstract \| code We present a new approach to creating photorealistic and relightable head avatars from a phone scan with unknown illumination. The reconstructed avatars can be animated and relit in real time with the global illumination of diverse environments. Unlike existing approaches that estimate parametric reflectance parameters via inverse rendering, our approach directly models learnable radiance transfer that incorporates global light transport in an efficient manner for real-time rendering. However, learning such a complex light transport that can generalize across identities is non-trivial. A phone scan in a single environment lacks sufficient information to infer how the head would appear in general environments. To address this, we build a universal relightable avatar model represented by 3D Gaussians. We train on hundreds of high-quality multi-view human scans with controllable point lights. High-resolution geometric guidance further enhances the reconstruction accuracy and generalization. Once trained, we finetune the pretrained model on a phone scan using inverse rendering to obtain a personalized relightable avatar. Our experiments establish the efficacy of our design, outperforming existing approaches while retaining real-time rendering capability. @inproceedings{li2024uravatar, title={URAvatar: Universal Relightable Gaussian Codec Avatars}, author={Li, Junxuan and Cao, Chen and Schwartz, Gabriel and Khirodkar, Rawal and Richardt, Christian and Simon, Tomas and Sheikh, Yaser and Saito, Shunsuke}, booktitle={SIGGRAPH Asia 2024 Conference Papers}, pages={1--11}, year={2024} }
	Sapiens: Foundation for Human Vision Models Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito ECCV 2024 (Best Paper Candidate). Top 15 papers. project page \| pdf \| abstract \| code We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error @inproceedings{khirodkar2025sapiens, title={Sapiens: Foundation for human vision models}, author={Khirodkar, Rawal and Bagautdinov, Timur and Martinez, Julieta and Zhaoen, Su and James, Austin and Selednik, Peter and Anderson, Stuart and Saito, Shunsuke}, booktitle={European Conference on Computer Vision}, pages={206--228}, year={2025}, organization={Springer} }
	Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Kristen Grauman et al. CVPR 2024 (Oral). Oral/Accepted: 3.3%. project page \| pdf \| abstract \| blog \| video We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community. @article{grauman2023ego, title={Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives}, author={Grauman, Kristen and Westbury, Andrew and Torresani, Lorenzo and Kitani, Kris and Malik, Jitendra and Afouras, Triantafyllos and Ashutosh, Kumar and Baiyya, Vijay and Bansal, Siddhant and Boote, Bikram and others}, journal={arXiv preprint arXiv:2311.18259}, year={2023} }
	Real-Time Simulated Avatar from Head-Mounted Sensors Zhengyi Luo, Jinkun Cao, Rawal Khirodkar, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu CVPR 2024 (Highlight). Highlight/Accepted: 11.9%. project page \| code
	Multi-Human 3D Reconstruction from In-the-Wild Videos Committee members: Kris Kitani, Deva Ramanan, Angjoo Kanazawa, Shubham Tulsiani Ph.D. Thesis, Carnegie Mellon University, 2023
	EgoHumans: An Egocentric 3D Multi-Human Benchmark Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, Kris Kitani ICCV 2023 (Oral). Oral/Accepted: 9.0%. EgoVis 2023 Distinguished Paper Award. project page \| pdf \| abstract \| bibtex \| code We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing soccer, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6 IDF1 and 9.3 HOTA on the EgoHumans dataset. @article{khirodkar2023egohumans, title={EgoHumans: An Egocentric 3D Multi-Human Benchmark}, author={Khirodkar, Rawal and Bansal, Aayush and Ma, Lingni and Newcombe, Richard and Vo, Minh and Kitani, Kris}, journal={arXiv preprint arXiv:2305.16487}, year={2023} }
	Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking Jinkun Cao, Xinshuo Weng, Rawal Khirodkar, Jiangmiao Pang, Kris Kitani CVPR 2023 pdf \| abstract \| bibtex \| code Multi-Object Tracking (MOT) has rapidly progressed with the development of object detection and re-identification. However, motion modeling, which facilitates object association by forecasting short-term trajectories with past observations, has been relatively under-explored in recent years. Current motion models in MOT typically assume that the object motion is linear in a small time window and needs continuous observations, so these methods are sensitive to occlusions and non-linear motion and require high frame-rate videos. In this work, we show that a simple motion model can obtain state-of-the-art tracking performance without other cues like appearance. We emphasize the role of "observation" when recovering tracks from being lost and reducing the error accumulated by linear motion models during the lost period. We thus name the proposed method as Observation-Centric SORT, OC-SORT for short. It remains simple, online, and real-time but improves robustness over occlusion and non-linear motion. It achieves 63.2 and 62.1 HOTA on MOT17 and MOT20, respectively, surpassing all published methods. It also sets new states of the art on KITTI Pedestrian Tracking and DanceTrack where the object motion is highly non-linear. @article{cao2022observation, title={Observation-centric sort: Rethinking sort for robust multi-object tracking}, author={Cao, Jinkun and Weng, Xinshuo and Khirodkar, Rawal and Pang, Jiangmiao and Kitani, Kris}, journal={arXiv preprint arXiv:2203.14360}, year={2022} }
	Occluded Human Mesh Recovery Rawal Khirodkar, Shashank Tripathi, Kris Kitani CVPR 2022 project page \| pdf \| abstract \| bibtex Top-down methods for monocular human mesh recovery have two stages: (1) detect human bounding boxes; (2) treat each bounding box as an independent single-human mesh recovery task. Unfortunately, the single-human assumption does not hold in images with multi-human occlusion and crowding. Consequently, top-down methods have difficulties in recovering accurate 3D human meshes under severe person-person occlusion. To address this, we present Occluded Human Mesh Recovery (OCHMR) - a novel top-down mesh recovery approach that incorporates image spatial context to overcome the limitations of the single-human assumption. The approach is conceptually simple and can be applied to any existing top-down architecture. Along with the input image, we condition the top-down model on spatial context from the image in the form of body-center heatmaps. To reason from the predicted body centermaps, we introduce Contextual Normalization (CoNorm) blocks to adaptively modulate intermediate features of the top-down model. The contextual conditioning helps our model disambiguate between two severely overlapping human bounding-boxes, making it robust to multi-person occlusion. Compared with state-of-the-art methods, OCHMR achieves superior performance on challenging multi-person benchmarks like 3DPW, CrowdPose and OCHuman. Specifically, our proposed contextual reasoning architecture applied to the SPIN model with ResNet-50 backbone results in 75.2 PMPJPE on 3DPW-PC, 23.6 AP on CrowdPose and 37.7 AP on OCHuman datasets, a significant improvement of 6.9 mm, 6.4 AP and 20.8 AP respectively over the baseline. @article{khirodkar2022occluded, title={Occluded Human Mesh Recovery}, author={Khirodkar, Rawal and Tripathi, Shashank and Kitani, Kris}, journal={arXiv preprint arXiv:2203.13349}, year={2022} }
	Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation Rawal Khirodkar, Visesh Chari, Amit Agrawal, Ambrish Tyagi ICCV 2021 project page \| Amazon Science \| pdf \| abstract \| bibtex \| code A key assumption of top-down human pose estimation approaches is their expectation of having a single person/instance present in the input bounding box. This often leads to failures in crowded scenes with occlusions. We propose a novel solution to overcome the limitations of this fundamental assumption. Our Multi-Instance Pose Network (MIPNet) allows for predicting multiple 2D pose instances within a given bounding box. We introduce a Multi-Instance Modulation Block (MIMB) that can adaptively modulate channel-wise feature responses for each instance and is parameter efficient. We demonstrate the efficacy of our approach by evaluating on COCO, CrowdPose, and OCHuman datasets. Specifically, we achieve 70.0 AP on CrowdPose and 42.5 AP on OCHuman test sets, a significant improvement of 2.4 AP and 6.5 AP over the prior art, respectively. When using ground truth bounding boxes for inference, MIPNet achieves an improvement of 0.7 AP on COCO, 0.9 AP on CrowdPose, and 9.1 AP on OCHuman validation sets compared to HRNet. Interestingly, when fewer, high confidence bounding boxes are used, HRNet's performance degrades (by 5 AP) on OCHuman, whereas MIPNet maintains a relatively stable performance (drop of 1 AP) for the same inputs. @InProceedings{Khirodkar_2021_ICCV, author = {Khirodkar, Rawal and Chari, Visesh and Agrawal, Amit and Tyagi, Ambrish}, title = {Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {3122-3131} }
	RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota, Kris M. Kitani ICCV 2021 pdf \| abstract \| bibtex \| code The use of iterative pose refinement is a critical processing step for 6D object pose estimation, and its performance depends greatly on one's choice of image representation. Image representations learned via deep convolutional neural networks (CNN) are currently the method of choice as they are able to robustly encode object keypoint locations. However, CNN-based image representations are computational expensive to use for iterative pose refinement, as they require that image features are extracted using a deep network, once for the input image and multiple times for rendered images during the refinement process. Instead of using a CNN to extract image features from a rendered RGB image, we propose to directly render a deep feature image. We call this deep texture rendering, where a shallow multi-layer perceptron is used to directly regress a view invariant image representation of an object. Using an estimate of the pose and deep texture rendering, our system can render an image representation in under 1ms. This image representation is optimized such that it makes it easier to perform nonlinear 6D pose estimation by adding a differentiable Levenberg-Marquardt optimization network and back-propagating the 6D pose alignment error. We call our method, RePOSE, a Real-time Iterative Rendering and Refinement algorithm for 6D POSE estimation. RePOSE runs at 71 FPS and achieves state-of-the-art accuracy of 51.6% on the Occlusion LineMOD dataset - a 4.1% absolute improvement over the prior art, and comparable performance on the YCB-Video dataset with a much faster runtime than the other pose refinement methods. @InProceedings{Iwase_2021_ICCV, author = {Iwase, Shun and Liu, Xingyu and Khirodkar, Rawal and Yokota, Rio and Kitani, Kris M.}, title = {RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {3303-3312} }

Professional Services
	CVPR (2018, 2020, 2021, 2022, 2023, 2024, 2025) ICCV (2019, 2021, 2023) ECCV (2020, 2022, 2024) NeurIPS (2022, 2023, 2024) ICLR (2022, 2023, 2024, 2025) IJCV (2022) TPAMI (2023) WACV (2021, 2022, 2023, 2024) ACCV (2018, 2020, 2021)

Template modified from here