Foundationv5.10

Multi-modal VAE

Compresses a patient's full molecular profile into a biological fingerprint

Architecture
Hierarchical Bayesian Deep VAE with Additive Decoder Architecture
Prolif Correlation
0.96

z_prolif correlation with proliferation markers

Target: > 0.90
Orthogonality (R²)
< 0.001

Statistical orthogonality: z_ctx_clean to proliferation

Target: < 0.10
Epigenetic Variance
0.607

Recovered z_meth variance. Enables detection of resistance mechanisms invisible to standard models.

Target: > 0.50
Reconstruction Loss
1.155

Combined NB + MSE + BCE likelihood

Target: < 1.5

Overview

A neural network that compresses a patient's full molecular profile — thousands of gene expression values, mutations, copy number changes, and methylation patterns — into a compact mathematical fingerprint of 328 numbers that capture the essential biology. Similar tumors end up nearby in this space, different tumors far apart. The model handles missing data gracefully: if only gene expression is available, it still produces a reliable fingerprint with wider uncertainty.

Latent Space Structure

Total:328d
z_prolif
1 dims
0.3%

Proliferation rate latent. Supervised on MKI67/PCNA/TOP2A/BUB1/PLK1. Target r > 0.90.

z_pathway
200 dims
61.0%

50 MSigDB Hallmark pathways × 4 dimensions each. Biologically interpretable pathway activities.

z_ctx_clean
31 dims
9.5%

Proliferation-free biological context. Residualized post-hoc with guaranteed zero prolif leakage.

z_residual
16 dims
4.9%

Captures variation not in curated pathways. Non-pathway-specific biological signal.

z_meth
48 dims
14.6%

Epigenetic patterns from methylation encoder. Correlates with differentiation state.

z_cnv_spatial
32 dims
9.8%

Chromosomal instability patterns from 1D CNN on copy number data.

Inputs

4 inputs
RNA-seq2,579 genes

Log1p-transformed TME Boost gene list + proliferation markers

Source: TCGA/GDC
CNV1,886 genes

Continuous copy number values, z-score standardized

Source: TCGA/GDC
DNA Mutations500 genes

Binary mutation indicators with zero-inflated likelihood

Source: TCGA/GDC
Methylation1,000 probes

Beta values [0,1], standardized

Source: TCGA/GDC

Outputs

2 outputs
z_for_ode_v1328

Canonical latent tensor for downstream integration

Consumers: driver-gat, hypernet, txresponse, dsn
ReconstructionsPer-modality

Reconstructed omics for validation

Mathematical Formulation

ELBO

Evidence Lower Bound with β-annealing

PoE Fusion

Product-of-Experts for multi-modal fusion

Proliferation Loss

Correlation supervision with proliferation markers

Key Features

  • Product-of-Experts (PoE) fusion for graceful missing modality handling
  • Additive decoder architecture for interpretable reconstruction
  • Pathway-guided factorization using MSigDB Hallmark gene sets
  • Supervised proliferation encoding with marker correlation loss
  • Post-hoc residualization for clean context separation

Key Innovations

  • 1Structured latent space with biological interpretability
  • 2Statistical orthogonality (R² < 0.001) between subspaces
  • 3Solved latent collapse: recovered epigenetic variance (0.607 vs ~0)
  • 4Multi-modal fusion robust to 40%+ missing data

Hyperparameters

Learning Rate
1e-3
Batch Size
128
β Schedule
Linear warmup 0→1 over 50 epochs
Free Bits
0.5 per latent group
Gradient Clip
1.0
Optimizer
AdamW (weight_decay=1e-4)

Training Details

Trained on 9,415 TCGA samples across 33 cancer types. Two-phase: Phase 1 freezes z_rna_private, Phase 2 unfreezes. GroupWiseKL + CrossReconstructionLoss for collapse prevention. Early stopping on validation ELBO.

Pipeline Position