Abstract
Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called "actionable counterfactuals" in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.
Method
Our approach computes a Causal Delta Embedding as the element-wise difference between post- and pre-intervention latent representations from a Vision Transformer encoder. The delta embedding serves as the sole input to an action classifier, trained with a combination of cross-entropy, supervised contrastive, and sparsity losses.
(A) Global Causal Delta Embedding model using CLS tokens. (B) Patch-Wise Delta Embedding model with Top-K aggregation.
Dataset
We evaluate on the Causal Triplet benchmark, which tests causal reasoning under two types of distribution shift:
- Compositional shift: The model encounters unseen action-object combinations at test time, while object classes remain the same as in training.
- Systematic shift: The model must generalize to entirely novel object categories absent from training.
The benchmark includes synthetic scenes from ProcTHOR (single- and multi-object) and real-world egocentric video from Epic-Kitchens, with increasing visual complexity across settings.
Compositional shift (ProcTHOR): novel action-object pairs at test time.
Systematic shift (ProcTHOR): novel objects at test time.
Systematic shift (Epic-Kitchens): real-world egocentric kitchen activities with novel objects at test time.
■ Blue = IID (train), ■ Red = OOD (test).
Results
Causal Delta Embeddings improves out-of-distribution generalization on the Causal Triplet benchmark. Our method reduces the OOD generalization gap from 0.56 to 0.21 on systematic distribution shifts and achieves near-perfect compositional generalization. The learned delta embeddings also reveal interpretable structure, with opposing actions producing near-anti-parallel embedding vectors.
Single-Object ProcTHOR
| Method | IID Acc. | OOD Comp. | OOD Syst. | Gap (↓) |
|---|---|---|---|---|
| Vanilla-R | 0.96±0.01 | 0.36±0.13 | 0.48±0.08 | 0.48 |
| Vanilla-V | 0.95±0.01 | 0.34±0.27 | 0.47±0.11 | 0.48 |
| ICM-R | 0.95±0.01 | 0.41±0.15 | 0.50±0.09 | 0.45 |
| ICM-V | 0.95±0.01 | 0.38±0.26 | 0.49±0.01 | 0.46 |
| SMS-R | 0.96±0.01 | 0.47±0.18 | 0.54±0.07 | 0.42 |
| SMS-V | 0.95±0.01 | 0.34±0.27 | 0.39±0.04 | 0.56 |
| Ours (ViT-CLIP) | 0.97±0.01 | 0.91±0.03 | 0.72±0.02 | 0.25 |
| Ours (ViT-DINO) | 0.96±0.01 | 0.91±0.02 | 0.75±0.02 | 0.21 |
| Ours (ViT-MAE) | 0.96±0.01 | 0.95±0.01 | 0.71±0.02 | 0.25 |
Table 1. Single-object ProcTHOR results. Our CDE models improve OOD generalization under both compositional and systematic shifts. R: ResNet-18, V: ViT-DINO backbone.
Multi-Object & Real-World
| Dataset | Method | IID Acc. | OOD Acc. | Gap (↓) |
|---|---|---|---|---|
| ProcTHOR | ResNet | 0.83±0.01 | 0.30±0.08 | 0.53 |
| Oracle-mask | 0.90±0.01 | 0.42±0.06 | 0.48 | |
| Slot-avg | 0.49±0.01 | 0.15±0.01 | 0.34 | |
| Slot-dense | 0.51±0.01 | 0.19±0.03 | 0.32 | |
| Slot-match | 0.66±0.01 | 0.21±0.01 | 0.45 | |
| Ours (ViT-MAE) | 0.91±0.01 | 0.30±0.02 | 0.61 | |
| Ours (ViT-DINO) | 0.92±0.00 | 0.45±0.03 | 0.47 | |
| Ours (ViT-CLIP) | 0.94±0.00 | 0.48±0.07 | 0.46 | |
| Epic-Kitchens | ResNet | 0.42±0.03 | 0.17±0.03 | 0.25 |
| CLIP | 0.45±0.02 | 0.24±0.02 | 0.21 | |
| Group-avg | 0.47±0.03 | 0.24±0.03 | 0.23 | |
| Group-dense | 0.50±0.04 | 0.26±0.03 | 0.24 | |
| Group-token | 0.52±0.03 | 0.27±0.03 | 0.25 | |
| Ours (ViT-MAE) | 0.50±0.02 | 0.30±0.02 | 0.20 | |
| Ours (ViT-DINO) | 0.54±0.01 | 0.33±0.00 | 0.21 | |
| Ours (ViT-CLIP) | 0.59±0.03 | 0.34±0.01 | 0.25 |
Table 2. Multi-object ProcTHOR and real-world Epic-Kitchens results under systematic distribution shift. Oracle-mask uses ground truth masks to isolate the intervened object.
Learned Embedding Structure
Pairwise cosine similarities between action delta embeddings reveal interpretable structure: opposing actions (e.g., open/close) produce near-anti-parallel vectors.
BibTeX
@article{alimisis2025learning,
title={Learning Robust Intervention Representations with Delta Embeddings},
author={Alimisis, Panagiotis and Diou, Christos},
journal={arXiv preprint arXiv:2508.04492},
year={2025}
}