Sapiens2

ICLR 2026

Rawal Khirodkar  ·  He Wen  ·  Julieta Martinez  ·  Yuan Dong  ·  Su Zhaoen  ·  Shunsuke Saito
Meta

Scroll
01 / 05

Model Family


02 / 05

Pretraining

Sapiens2 is pretrained on a curated corpus of 1 billion human images covering broad demographics, poses, and contexts.

Sapiens 300 M images
Sapiens 2 1 Billion images

Sapiens2 adopts a unified pretraining objective that combines dense and sparse contrastive losses with masked pixel reconstruction — yielding human representations that are simultaneously semantic and faithful to fine image detail.

Image View1 View2 Encoder Student Encoder — receives the masked View1 and is updated by ℒ_mae + ℒ_cl Encoder Teacher Encoder — EMA-updated from the student; provides contrastive targets EMA Patch Decoder Patch Decoder — reconstructs masked patches in pixel space (MAE objective) mae CLS Decoder Student CLS Decoder — projects to the contrastive embedding CLS Decoder Teacher CLS Decoder — EMA-updated target for the student EMA stop grad cl

03 / 05

Emergent Capabilities

Without any task supervision, Sapiens2 learns dense features that align semantically across subjects — hover a patch on any subject and cosine similarity lights up the corresponding patches on the others.

Subject A · (24, 22)


04 / 05

Segmentation


05 / 05

Normal


@inproceedings{khirodkar2026sapiens2,
  title     = {Sapiens2: Foundation for Human Vision},
  author    = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}