Model Family
Pretraining
Sapiens2 is pretrained on a curated corpus of 1 billion human images covering broad demographics, poses, and contexts.
Sapiens2 adopts a unified pretraining objective that combines dense and sparse contrastive losses with masked pixel reconstruction — yielding human representations that are simultaneously semantic and faithful to fine image detail.
Emergent Capabilities
Without any task supervision, Sapiens2 learns dense features that align semantically across subjects — hover a patch on any subject and cosine similarity lights up the corresponding patches on the others.
Subject A · (24, 22)
Segmentation
Normal
@inproceedings{khirodkar2026sapiens2,
title = {Sapiens2: Foundation for Human Vision},
author = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}