When to Align, When to Predict?

A Phase Diagram for Multimodal Self-Supervised Learning

Ilay Kamai1*  ·  Hugues Van Assel2  ·  Aviv Regev2  ·  Hagai B. Perets1  ·  Randall Balestriero3,4

1Technion  ·  2Genentech  ·  3Brown University  ·  4Meta AI, FAIR
*Corresponding author: ilay.kamai@campus.technion.ac.il

Phase diagram animation sweeping noise parameters

Abstract

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all --- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image--caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training.

Approach

We study two objectives for multimodal representation learning. The first is cross-alignment (CA), which aligns paired samples in a shared latent space. The second is cross-prediction (CP), which predicts one modality from the other through an encoder–decoder factorization. Both are formalized below:


Let \(f_X : \mathbb{R}^{d_x} \to \mathbb{R}^k\) and \(f_Y : \mathbb{R}^{d_y} \to \mathbb{R}^k\) be encoders producing latent codes \(\mathbf{z}_x^{(i)} := f_X(\mathbf{x}_i)\) and \(\mathbf{z}_y^{(i)} := f_Y(\mathbf{y}_i)\) in a shared latent space of dimension \(k\). Let \(f_D : \mathbb{R}^k \to \mathbb{R}^{d_y}\) be a decoder. The two objectives are

\[ \begin{aligned} \text{(CA)} \quad &\min_{f_X,\, f_Y} \; \tfrac{1}{n} \textstyle\sum_i \| f_X(\mathbf{x}_i) - f_Y(\mathbf{y}_i) \|_2^2 \quad \text{s.t.} \quad \tfrac{1}{n} \textstyle\sum_i f_X(\mathbf{x}_i)\, f_Y(\mathbf{y}_i)^\top = \mathbf{I}_k, \\[6pt] \text{(CP)} \quad &\min_{f_X,\, f_D} \; \tfrac{1}{n} \textstyle\sum_i \| \mathbf{y}_i - f_D(f_X(\mathbf{x}_i)) \|_2^2. \end{aligned} \]

With linear encoders \(f_X(\mathbf{x}) = W\mathbf{x}\), \(f_Y(\mathbf{y}) = V\mathbf{y}\) for CA, and linear encoder \(f_X(\mathbf{x}) = E\mathbf{x}\) and decoder \(f_D(\mathbf{z}) = D\mathbf{z}\) for CP, both objectives admit closed-form solutions expressible through the SVDs of two modality-coupling matrices:

\[ C := \Sigma_{xx}^{-1/2}\, \Sigma_{xy}\, \Sigma_{yy}^{-1/2}, \qquad A := \Sigma_{yx}\, \Sigma_{xx}^{-1/2}, \]

where \(\Sigma_{xx}, \Sigma_{yy}, \Sigma_{xy}\) denote the (population) (cross-)covariances.


To analyze recovery, we posit a signal-plus-noise model in which each modality decomposes into \(k\) shared signal coordinates and \(d - k\) modality-specific nuisance coordinates.

The singular values of \(C\) and \(A\) split into signal values \(\{\rho_i,\,\tau_i\}_{i=1}^{k}\) and nuisance values \(\{\nu_j,\,\xi_j\}_{j=1}^{d-k}\), given by

\[ \rho_i = \frac{\kappa_i^2}{\sqrt{(\kappa_i^2 + \gamma_i^x)(\kappa_i^2 + \gamma_i^y)}}, \quad \tau_i = \frac{\kappa_i^2}{\sqrt{\kappa_i^2 + \gamma_i^x}}, \quad \nu_j = \frac{\eta_j}{\sqrt{\tilde{\gamma}_j^x\, \tilde{\gamma}_j^y}}, \quad \xi_j = \frac{\eta_j}{\sqrt{\tilde{\gamma}_j^x}}. \]

CA (resp. CP) fully recovers the shared signal subspace whenever its signal singular values exceed its nuisance singular values. Defining the ratio of signal to nuisance singular values as the separation ratios, we define 4 different recovery regimes — Both, CP only, CA only, and Neither — as described in the following table:

Region CA recovers? CP recovers?
Both
CA only
CP only
Neither
Phase diagram with CA and CP recovery boundaries

Each method shine and fall under different scenarios: CA is symmetric and preferable when modality-specific noise is large or uncertain. CP is assymetric, requires the correct source-target direction, and is preferable when signal is strong and the modality-specific noise is weak.
Importatnly, the Neither regime is a carachteristic of real world problems with complementary modaitlies, and low signal-to-noise ratio. This regime requires novel alignment methods, beyond simple CA or CP.

Synthetic Experiments

We validate on dSprites and Shapes3D stereo-view datasets, sweeping jitter (de-alignment) and noise parameters to move across the phase diagram. CA (VICReg) and CP (MSE reconstruction) are trained from scratch; downstream accuracy is measured via frozen linear probe.

CA vs CP accuracy across dSprites and Shapes3D configurations

Linear probe accuracy vs. nuisance alignment \(\nu_{\max} = 1 - \sigma_{\text{jitter}}\). Left: Stereo-dSprites (3-class, grayscale, 100k samples). Right: Stereo-3DShapes (4-class, RGB, 100k samples). In both settings, color represents weak noise levels. In both panels, the trade-off between the methods is clearly seen. CA (solid, circles) peaks at moderate-to-low alignment and collapses at full alignment; CP (dashed, squares) shows the opposite pattern. The crossover at \(\nu_{\max} \approx 0.8\) is consistent across datasets. Lower absolute ceilings in 3DShapes reflect the harder discrimination task.

dSprites UMAP embeddings

UMAP embeddings of learned representations of the stereo-dSprites experiment (color = shape, intensity = position). From left to right: CA with aligned noise (\(\sigma_{\text{jitter}}=0\)), CA with misaligned noise (\(\sigma_{\text{jitter}}=0.5\)), CP with aligned noise, and CP with misaligned noise. All experiments have the same modality-specific noise (\(\sigma_{\text{noise}}=0.5\)). Each method succeeds exactly where the other fails, and on failures, the models learn the nuisance.

Real-World Experiments: COCO (Image–Caption)

COCO style-noise experiment results

Top-1 accuracy vs. image style transform strength for MS-COCO experiment. CP shows an asymmetric nature: prediction of image from text results in similar performance as CA but prediction of text from an image results in much better performance. Both approaches converge to the same accuracy when image noise is high.

Real-World Experiments: Astrophysics (LAMOST × Kepler & TESS)

Astrophysical cross-modal results, mean ± std over 5 seeds. Best per row in bold (ties within seed std co-bolded). Kepler (Both, \(\hat{\DCA}{=}1.13\), \(\hat{\DCP}{=}2.22\)): at least one cross-modal method matches or beats the best unimodal baseline on every target; CP preserves LAMOST's ceiling where LAMOST dominates, CA captures photometric signal where LAMOST is weak. TESS (Neither, \(\hat{\DCA}, \hat{\DCP} < 1\)): no cross-modal method beats LAMOST-only on any target. CPrev (photometry → spectra) fails on tasks where spectra carry the more direct signal (\(\log g\), binarity), but outperforms forward CP on age, where photometric rotation provides a more direct gyrochronological signal — the same source-quality principle in both directions.
Task CA CP CPrev LAMOST Photometry
LAMOST × Kepler — Both regime
Binarity (bal. acc.) 0.802±0.009 0.814±0.006 0.751±0.004 0.814 0.731
log g (R²) 0.956±0.003 0.976±0.001 0.639±0.004 0.977 0.542
Age (R²) 0.620±0.001 0.434±0.006 0.497±0.039 0.431 0.470
LAMOST × TESS — Neither regime
Binarity (bal. acc.) 0.756±0.022 0.763±0.011 0.626±0.010 0.779 0.604
log g (R²) 0.929±0.005 0.939±0.001 −0.312±0.001 0.942 −0.312
Age (R²) 0.431±0.029 0.396±0.064 −0.072±0.004 0.503 −0.037
Astrophysical phase diagnostics: CCA and CP spectra

Each panel shows the sorted singular values (gray bars, left axis) and per-component \(R^2\) against \(\log g\) (teal line, right axis) used by our phase estimation algorithm to classify components as signal (green shading) or nuisance (red shading). Dashed horizontal line: nuisance floor \(\max_j \hat{\nu}_j\) (CCA panels) or \(\max_j \hat{\xi}_j\) (\(\Ab\)-SVD panels); \(\hat{\Delta} > 1\) iff every classified signal singular value exceeds the nuisance floor. Top: LAMOST × Kepler — both decompositions have signal components above the nuisance floor (\(\hat{\DCA} = 1.13\), \(\hat{\DCP} = 2.22\); Both regime). Bottom: LAMOST × TESS — no CCA component predicts \(\log g\) above noise (\(R^2 \approx 0\) across all components, zero signal detected); the \(\Ab\)-SVD has one candidate signal component but below the nuisance floor. Both ratios fall below one (Neither regime). The contrast between the two rows — same LAMOST encoder, same protocol, different photometric instrument — shows that instrument quality determines the signal–nuisance separation and hence the regime placement.

Try it interactively in the Colab notebook .

Citation

@article{,
  title   = {When to Align, When to Predict? A Phase Diagram for Multimodal Self-Supervised Learning},
  author  = {Kamai, Ilay and Van Assel, Hugues and Regev, Aviv and Perets, Hagai B. and Balestriero, Randall},
  journal = {arXiv preprint},
  year    = {2026}
}