CROSS-MODAL VAE
Biological Prediction
Key Metrics
0.75
PEARSON CORRELATION
10,000
RNA FEATURES
25
ADT PROTEINS
64
LATENT DIMENSION
4
DISCRIMINATORS
2nd Place
COMPETITION RANK
OVERVIEW
Grand Valley's Machine Learning course (CIS 678) challenged students to build a variational autoencoder capable of predicting multimodal single-cell sequencing data — specifically linking RNA gene expression profiles to ADT (Antibody-Derived Tag) protein expression. Derived from dissociated tissue samples like blood, these single-cell modalities produce matrices capturing gene expression (10,000 RNA features) and surface protein abundance (25 ADT features). The core challenge: the RNA and ADT samples are not paired — they come from independent measurements, meaning the model must learn the underlying biological relationships between modalities without ever seeing matched observations.
Performance was evaluated by Pearson Correlation between predicted and actual ADT expression on a held-out test set of 5,000 cells, scored through a class-wide Kaggle competition. Anything below 0.75 was considered indistinguishable from noise given the inherent variability in single-cell data.
EXPLORATORY ANALYSIS
Initial analysis revealed extreme sparsity across both modalities. The RNA training data contained over 45.6 million zero entries, while ADT had 829 zeros across its 25 features. Histograms of non-zero values highlighted a key distributional divergence: the RNA distribution exhibited a relatively smooth log-normal decay, while ADT showed a sharper peak and steeper decline — indicating a more concentrated expression profile across fewer features.
This divergence in distributional shape and scale underscored why a shared latent space must not only capture the core biological signal across modalities but also account for modality-specific sparsity patterns. The zero-inflation in both datasets reinforced the need for an architecture capable of learning robust, denoised representations rather than simply memorizing sparse input patterns.
ARCHITECTURE
The team developed a fully modular, object-oriented framework in Python using PyTorch Lightning, opting for structured Python scripts over notebook-style development for maintainability and scalability.
The architecture implements a dual-branch variational autoencoder with six core components:
- RNAEncoder (10,000 → 1,024 → 512 → 64): Compresses high-dimensional gene expression into a modality-specific embedding
- ADTEncoder (25 → 64 → 32 → 64): Maps the smaller protein feature space into the same embedding dimensionality
- SharedEncoder (64 → 1,024 → 512 → 256 → 128 → 128): Maps modality-specific embeddings into a unified latent space, outputting mean (μ) and variance (σ²) for reparameterization
- SharedDecoder: Mirrors the shared encoder, reconstructing from the latent space back to modality-specific embeddings
- RNADecoder and ADTDecoder: Reconstruct the original feature spaces from decoded embeddings
Reparameterization occurs in the shared latent space (z = μ + σ · ε), enabling gradient flow through the stochastic sampling step. Batch normalization, layer normalization, and dropout are applied selectively throughout the network. Kaiming initialization is used for all linear layers, and Gaussian noise is injected into inputs during training for a denoising effect.
TensorBoard served as the primary experiment tracking platform, providing real-time visualization of training metrics, latent space embeddings via PCA and UMAP projections, and parallel coordinates plots for hyperparameter comparison.
TRAINING STRATEGY
Training proceeded in three deliberate phases, each building on the prior checkpoint:
Phase 1 — Autoencoder Reconstruction: Each modality (RNA and ADT) was trained to reconstruct itself through the shared latent space, minimizing MSE reconstruction loss plus a KL divergence penalty: L = MSE(x, x̂) + β · KL(q(z|x) ‖ p(z)). The KL weight (β) was managed through cyclical logistic annealing using an open-source Annealer library, preventing latent space collapse by alternating between reconstruction focus and regularization. Integration score (negative mean Euclidean distance between RNA and ADT latent embeddings) and silhouette score were tracked alongside reconstruction metrics to monitor cross-modal alignment.
Phase 2 — Cross-Modal Translation: Loading the best autoencoder checkpoint, the model was fine-tuned for cross-modal prediction — RNA→ADT and ADT→RNA translation. The same architecture was used, with the encoder for one modality feeding into the decoder for the other. This phase achieved the team's best Kaggle score of 0.75 Pearson Correlation.
Phase 3 — Adversarial Refinement: To push beyond the cross-modal plateau, the team introduced adversarial training with up to four discriminators: a latent discriminator (post-reparameterization), a pre-latent discriminator (pre-reparameterization on shared encoder output), an RNA output discriminator, and an ADT output discriminator. A Gradient Reversal Layer (GRL) reversed gradients during backpropagation, encouraging the encoders to produce modality-invariant latent representations. Managing the adversarial dynamics with four discriminators — each with its own loss, accuracy, GRL lambda, and cumulative training pressure — proved to be the project's most complex engineering challenge, requiring careful calibration of discriminator training frequency, weight balancing, and KL recalibration to prevent latent space collapse.
HYPERPARAMETER TUNING
Hyperparameter optimization leveraged Meta's Ax platform (a wrapper around BoTorch) for multi-objective Bayesian optimization — a significant upgrade from the manual grid search used in prior projects.
The Gaussian Process-based optimization simultaneously maximized two objectives: ADT Pearson Correlation (reconstruction quality) and integration score (latent space alignment). This multi-objective approach was critical because maximizing one objective often degraded the other — a model with excellent reconstruction might maintain modality-separated clusters, while a well-integrated latent space might sacrifice reconstruction fidelity.
Correlation analysis of hyperparameters with objectives revealed that batch size showed 65% correlation with ADT Pearson, while latent dimensionality drove integration scores with 71% correlation. The Pareto frontier identified 4 optimal solutions balancing the trade-off between integration and correlation.
However, quantitative metrics alone proved insufficient for model selection. Visual inspection of UMAP latent space embeddings was essential — integration scores could miss latent space collapse, while high Pearson scores sometimes masked poorly integrated clusters. The final model selection combined Pareto analysis with manual inspection of latent geometry.
Key hyperparameters for the final autoencoder: latent dimension = 64, batch size = 64, learning rate = 0.0005, cyclical logistic KL annealing over 12 epochs, with carefully tuned per-modality KL weights (RNA: 5e-5, ADT: 7.9e-6). Additional optimization strategies included AdamW with cosine annealing of the learning rate and L2 regularization (1e-8) for cross-modal and adversarial training.
RESULTS
The final model achieved a 0.75 Pearson Correlation on the Kaggle leaderboard — surpassing the 0.73 target threshold and earning 2nd place in the class-wide competition.
The autoencoder phase produced an excellent UMAP embedding with distinct substructure while showing significant overlap between RNA and ADT modalities in the latent space — confirming the model's success in learning modality-agnostic representations. Training converged at epoch 28, with the best model selected based on a combination of ADT Pearson, integration score, and visual inspection of latent geometry.
The cross-modal phase built on this foundation to achieve accurate RNA→ADT translation. The team also achieved a peak Pearson of 0.7668 during experimentation, but this result came from a model that did not use the proper shared encoder/decoder structure or correct reparameterization placement — a configuration the team deliberately chose not to pursue, as it violated the foundational VAE architecture.
The adversarial extension with four discriminators showed promising convergence dynamics but required more training time than the competition timeline allowed. The single-discriminator GAN approach achieved stable training with discriminator accuracy converging to an appropriate equilibrium, while the WGAN-GP variant provided enhanced stability at the cost of significantly higher computational expense.
FUTURE WORK
Several directions were identified for further development. The adversarial training framework, while functional, could benefit from extended training time and more sophisticated discriminator scheduling — the four-discriminator configuration showed clear potential but was constrained by the competition deadline. The WGAN-GP approach, despite its computational cost, offered superior training stability and could yield better results with sufficient GPU resources.
Additional improvements could include attention mechanisms in the shared encoder to better capture cross-modal feature interactions, curriculum learning strategies that gradually increase cross-modal task difficulty, and contrastive learning objectives to improve latent space structure beyond what adversarial training alone achieves.
Tech Stack
Details
Team
Steve Meadows, Lauryn Davis, Brooke Walters
Course
CIS 678 — Machine Learning
Timeline
Winter 2025