Learning Robust Intervention Representations with Delta Embeddings

Harokopio University of Athens
Causal Delta Embeddings overview: scene-invariant, sparse intervention representations learned from image pairs

Causal Delta Embeddings represent interventions as scene-invariant, sparse vectors in latent space, enabling robust out-of-distribution generalization for intervention classification from image pairs.

Abstract

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called "actionable counterfactuals" in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

Method

Our approach computes a Causal Delta Embedding as the element-wise difference between post- and pre-intervention latent representations from a Vision Transformer encoder. The delta embedding serves as the sole input to an action classifier, trained with a combination of cross-entropy, supervised contrastive, and sparsity losses.

Model architecture: Global Causal Delta Embedding and Patch-Wise Delta Embedding models

(A) Global Causal Delta Embedding model using CLS tokens. (B) Patch-Wise Delta Embedding model with Top-K aggregation.

Dataset

We evaluate on the Causal Triplet benchmark, which tests causal reasoning under two types of distribution shift:

  • Compositional shift: The model encounters unseen action-object combinations at test time, while object classes remain the same as in training.
  • Systematic shift: The model must generalize to entirely novel object categories absent from training.

The benchmark includes synthetic scenes from ProcTHOR (single- and multi-object) and real-world egocentric video from Epic-Kitchens, with increasing visual complexity across settings.

Compositional distribution shift in ProcTHOR: IID (blue) and OOD (red) action-object combinations

Compositional shift (ProcTHOR): novel action-object pairs at test time.

Systematic distribution shift in ProcTHOR: IID training objects vs novel OOD objects

Systematic shift (ProcTHOR): novel objects at test time.

Systematic distribution shift in Epic-Kitchens: IID vs OOD object splits

Systematic shift (Epic-Kitchens): real-world egocentric kitchen activities with novel objects at test time.

■ Blue = IID (train), ■ Red = OOD (test).

Results

Causal Delta Embeddings improves out-of-distribution generalization on the Causal Triplet benchmark. Our method reduces the OOD generalization gap from 0.56 to 0.21 on systematic distribution shifts and achieves near-perfect compositional generalization. The learned delta embeddings also reveal interpretable structure, with opposing actions producing near-anti-parallel embedding vectors.

Single-Object ProcTHOR

Method IID Acc. OOD Comp. OOD Syst. Gap (↓)
Vanilla-R0.96±0.010.36±0.130.48±0.080.48
Vanilla-V0.95±0.010.34±0.270.47±0.110.48
ICM-R0.95±0.010.41±0.150.50±0.090.45
ICM-V0.95±0.010.38±0.260.49±0.010.46
SMS-R0.96±0.010.47±0.180.54±0.070.42
SMS-V0.95±0.010.34±0.270.39±0.040.56
Ours (ViT-CLIP)0.97±0.010.91±0.030.72±0.020.25
Ours (ViT-DINO)0.96±0.010.91±0.020.75±0.020.21
Ours (ViT-MAE)0.96±0.010.95±0.010.71±0.020.25

Table 1. Single-object ProcTHOR results. Our CDE models improve OOD generalization under both compositional and systematic shifts. R: ResNet-18, V: ViT-DINO backbone.

Multi-Object & Real-World

Dataset Method IID Acc. OOD Acc. Gap (↓)
ProcTHORResNet0.83±0.010.30±0.080.53
Oracle-mask0.90±0.010.42±0.060.48
Slot-avg0.49±0.010.15±0.010.34
Slot-dense0.51±0.010.19±0.030.32
Slot-match0.66±0.010.21±0.010.45
Ours (ViT-MAE)0.91±0.010.30±0.020.61
Ours (ViT-DINO)0.92±0.000.45±0.030.47
Ours (ViT-CLIP)0.94±0.000.48±0.070.46
Epic-KitchensResNet0.42±0.030.17±0.030.25
CLIP0.45±0.020.24±0.020.21
Group-avg0.47±0.030.24±0.030.23
Group-dense0.50±0.040.26±0.030.24
Group-token0.52±0.030.27±0.030.25
Ours (ViT-MAE)0.50±0.020.30±0.020.20
Ours (ViT-DINO)0.54±0.010.33±0.000.21
Ours (ViT-CLIP)0.59±0.030.34±0.010.25

Table 2. Multi-object ProcTHOR and real-world Epic-Kitchens results under systematic distribution shift. Oracle-mask uses ground truth masks to isolate the intervened object.

Learned Embedding Structure

Heatmap of pairwise cosine similarities between delta embeddings showing anti-parallel structure for opposing actions

Pairwise cosine similarities between action delta embeddings reveal interpretable structure: opposing actions (e.g., open/close) produce near-anti-parallel vectors.

BibTeX

@article{alimisis2025learning,
  title={Learning Robust Intervention Representations with Delta Embeddings},
  author={Alimisis, Panagiotis and Diou, Christos},
  journal={arXiv preprint arXiv:2508.04492},
  year={2025}
}