ml1st Place

NEURAL NETWORK CITE

ADT Protein Prediction

Neural NetworksDeep LearningBioinformaticsADAM Optimizer

Key Metrics

0.858

PEARSON CORRELATION

+7.0%

VS. BASELINE (MLR)

ADT PROTEINS

639

RNA FEATURES

ARCHITECTURES TESTED

1st Place

COMPETITION RANK

OVERVIEW

Grand Valley's Machine Learning course (CIS 678) challenged students to predict Antibody-Derived Tag (ADT) protein expression from RNA sequencing data — a core problem in computational biology where direct protein measurement is expensive but RNA is readily available. The training dataset contained 4,000 cell observations across 639 RNA gene expression features, with the goal of predicting expression levels for 25 ADT surface proteins. Model performance was evaluated by Pearson Correlation on a held-out test set of 1,000 cells, scored through a class-wide Kaggle competition.

The problem is fundamentally one of capturing complex, non-linear relationships between gene expression and protein abundance — relationships that traditional linear methods can approximate but not fully exploit. This motivated the team to build a custom feedforward neural network from scratch in R using matrix algebra, progressively layering in modern deep learning techniques to push beyond the baseline.

EXPLORATORY ANALYSIS

Principal Component Analysis on the 639 RNA features revealed that PC1 alone explains ~64% of the total variance, with PC2 contributing only 3.4%. This extreme concentration suggests that many RNA features share common expression patterns, with a dominant latent structure driving the data.

The dense clustering observed in the PCA scatter plot confirmed this shared structure, while the steep variance dropoff implied that dimensionality reduction could compress the feature space without significant information loss. However, given the strong results achieved through network tuning alone, the team ultimately did not pursue PCA-based feature reduction — the neural network proved capable of learning its own effective internal representations from the raw 639 features.

APPROACH

The project followed a deliberate progression from traditional statistics to deep learning, using each stage to establish baselines and build understanding.

Baseline: Multiple Linear Regression — The initial approach implemented OLS regression via matrix algebra (solving B̂ = (X'X)⁻¹X'Y) to predict all 25 ADT proteins simultaneously. This baseline achieved a 0.802 Pearson Correlation — a strong starting point that validated the relationship between RNA and protein expression, but one that could not capture the non-linear dynamics inherent in the data.

Custom Neural Network (From Scratch) — Rather than importing a framework, the team built a fully connected feedforward neural network from the ground up in R using matrix algebra. This included hand-coded implementations of forward propagation, backpropagation, and gradient descent. The function-oriented codebase was later refactored into an object-oriented paradigm using R's R6 class system for cleaner modularity.

Key implementations built from scratch: • Custom activation functions (Swish, ReLU, Softplus, Sigmoid) with hand-derived gradients • ADAM optimizer with bias-corrected first and second moment estimates • Dropout regularization with inverted scaling • Mini-batch gradient descent with configurable batch sizes • L2 regularization with weight decay • Gradient clipping to prevent exploding gradients • Composite early stopping criteria combining loss improvement and gradient norm stability

Torch Integration — With a deep understanding of the mechanics, the team transitioned to R's Torch library (analogous to PyTorch) to leverage GPU-accelerated training, built-in batch normalization, and streamlined optimizer APIs — enabling the scale of experimentation needed for systematic hyperparameter tuning.

ARCHITECTURE

The team tested 11 network architectures ranging from shallow single-hidden-layer designs (639→128→25) to deep six-hidden-layer configurations (639→1024→512→256→128→64→32→25), evaluated across two learning rates using a custom composite score that weighted R², Pearson correlation, cosine similarity, validation loss, and the generalization gap.

A key insight emerged: deeper is not always better. While the deepest networks occasionally produced the highest individual scores, they exhibited large generalization gaps between training and validation loss — a hallmark of overfitting. The moderately sized architectures, particularly 639→512→256→128→25 (four layers) and 639→512→256→128→64→25 (five layers), consistently dominated across metrics.

The final architecture selected was the four-layer 639→512→256→128→25 network with ReLU activation across all hidden layers. This design balanced capacity (enough neurons to capture complex RNA-protein relationships) with generalizability (avoiding the noise memorization seen in deeper configurations).

HYPERPARAMETER TUNING

Hyperparameter optimization followed a systematic, research-driven schedule — tuning one dimension at a time, carrying forward the best settings, and visualizing results at every stage with interactive Plotly plots and heatmaps.

Learning Rate — Tested across a logarithmic range. Learning rates of 0.01 and 0.001 demonstrated the strongest performance; smaller rates prolonged training without meaningful gains.

Dropout — Exhaustive grid search across all 11 architectures with dropout rates from 0 to 0.50 in 0.05 increments, at both learning rates. Heatmap analysis revealed that mid-range dropout (0.20–0.40) benefited moderately sized networks, while shallow networks were highly sensitive to low rates and deep networks degraded sharply at high rates. Dropout lifted validation Pearson by 0.02–0.05 across the best architectures.

Batch Size — Seven batch sizes (32, 48, 64, 82, 120, 140, 180) were tested across the top architectures and dropout configurations. The best result — 0.872 Pearson — was achieved with the 512→256→128 architecture at LR=0.001, dropout=0.30, batch size=48.

Activation Functions — Compared ReLU, Sigmoid, Softplus, and Swish across layers. ReLU performed most consistently and was adopted as the primary activation.

Normalization — Batch normalization outperformed layer normalization across all tracked metrics and was retained for subsequent experiments.

L2 Regularization — Testing across a range from 0 to 1e-2 showed that while L2 reduced the generalization gap, it came at the cost of significantly degraded R² and Pearson scores. The team opted to rely on dropout as the primary regularizer.

ADAM Tuning — Default β₁=0.9 and β₂=0.999 proved near-optimal. Slight adjustments to β₂ yielded marginal improvements insufficient to justify changing defaults.

Bayesian Optimization — The ParBayesianOptimization package automated a 200-iteration search across learning rate (0.0001–0.01), dropout (0–0.5), batch size (32–256), and architecture. The Gaussian process identified the global optimum at iteration 51: LR=0.0013, dropout=0.19, batch size=140, architecture 639→512→256→128→25. Crucially, the Bayesian analysis confirmed that the shallower network more consistently achieved low loss — validating the team's manual tuning conclusions.

RESULTS

The final model achieved a 0.858 Pearson Correlation — a 7% improvement over the baseline MLR model and enough to earn 1st place in the class-wide Kaggle competition. The progression from baseline to final model illustrates the cumulative impact of each optimization:

Baseline MLR (matrix algebra OLS): Pearson = 0.802
Custom neural network (no regularization): Pearson ≈ 0.84
+ Dropout regularization: Pearson = 0.869
+ Batch size tuning: Pearson = 0.872
+ Batch normalization + ADAM: Pearson = 0.858 (Kaggle, best generalization)

An important finding was that the highest validation Pearson (~0.872) did not always translate to the best Kaggle score. The team discovered that minimizing the generalization gap between training and validation loss was a stronger predictor of Kaggle performance than maximizing any single validation metric. This insight shifted the optimization strategy toward configurations that balanced accuracy with generalizability.

The Bayesian optimization confirmed that the shallower 639→512→256→128→25 architecture consistently outperformed the deeper variant in loss distribution, even though the deeper network occasionally produced individual peak scores. This validated the principle that model simplicity, when paired with careful tuning, outperforms raw capacity.

FUTURE WORK

Several directions were identified for further improvement. PCA-based dimensionality reduction could compress the 639 RNA features into a lower-dimensional space before feeding into the network, potentially reducing noise and training time. The Bayesian optimization's getLocalOptimums() function could identify multiple promising hyperparameter configurations for ensemble learning — combining the strengths of several well-tuned models rather than relying on a single global optimum.

The composite scoring function used during architecture evaluation could be refined by assigning greater weight to the training-validation loss gap, which proved to be the strongest predictor of generalization performance. Additionally, more sophisticated cross-validation strategies (such as k-fold) could replace the fixed 90/10 split to provide more robust performance estimates across the hyperparameter search.

Tech Stack

RMatrix AlgebraNeural NetworksADAMBayesian OptimizationPCAPlotly

Details

Team

Steve Meadows, Lauryn Davis, Brooke Walters

Course

CIS 678 — Machine Learning

Timeline

Fall 2024

View Notebook GitHub