AnyThermal uses two ViT-B/14 DINOv2 encoders: a frozen RGB teacher and a trainable thermal student, both initialized with pre-trained RGB weights.
Thermal images are converted to three channels and passed through the student; a contrastive loss aligns the CLS-token embeddings of corresponding RGB–thermal pairs, encouraging shared global semantics while relaxing pixel-perfect alignment.
Distillation is performed across multiple RGB–thermal datasets, enabling the backbone to learn environment-agnostic thermal representations that generalize across tasks and domains.
In particular, AnyThermal is trained using five datasets spanning diverse environments, including urban driving datasets such as ViVID++ (outdoor sequences), STheREo, Freiburg, and TartanRGBT (ours); the Boson Nighttime Dataset for aerial scenarios; and TartanRGBT (ours) for both indoor and off-road environments.