ml2nd Place

WINE AI

Predicting Tasting Notes from Climate

TransformersLightGBMNLPFeature EngineeringDistilBERTEnsemble

Key Metrics

0.01873

FINAL MAE

10,000+

WINES ANALYZED

GRAPE VARIETALS

1,416

TASTING KEYWORDS

33→21

FEATURE REDUCTION

2nd Place

COMPETITION RANK

OVERVIEW

Grand Valley's Machine Learning course (CIS 678) challenged students to predict the probability of specific tasting note keywords appearing in a wine review, using only climate data and grape varietal as inputs. The dataset comprised over 10,000 wines across 64 grape varietals, each associated with daily climate data from the 2022 vintage year — minimum and maximum temperatures, rainfall, and sunshine duration spanning March 1 to November 1 (246 days × 4 modalities = 984 raw climate features). The target variable was a matrix of 1,416 tasting note keyword probabilities derived from smoothed word count data.

By modeling these relationships, the project explores how environmental factors influence the language used to describe wine — offering insights into how climate conditions shape sensory perception and market reception. Performance was evaluated by Mean Absolute Error (MAE) on a held-out test set of 1,001 wines, scored through a class-wide Kaggle competition.

EXPLORATORY ANALYSIS

Distribution analysis of the four climate modalities revealed distinct statistical profiles. Maximum and minimum temperatures followed approximately normal distributions centered around seasonal norms, while rainfall was heavily right-skewed with many zero-rain days, and sunshine duration exhibited a bimodal pattern reflecting cloud-cover variability. These distributional differences motivated modality-specific normalization: standard scaling for temperature features, log-transformation for rainfall (to compress the heavy tail), and log + standard scaling for sunshine.

The word probability matrix was extremely sparse — approximately 80% zeros — with most tasting notes appearing infrequently across the corpus. This sparsity became a central design constraint, influencing loss function selection (focal loss for the neural network, Tweedie loss for LightGBM) and motivating the development of cluster-based features that aggregate sparse word signals into denser, more predictive representations.

FEATURE ENGINEERING

Feature engineering was the backbone of this project, contributing more to model performance than architecture selection alone. The team developed five categories of engineered features:

Climate Aggregations — Moving averages at 7, 14, and 30-day windows for each modality, plus seasonal averages (spring, summer, fall rainfall), number of rainy days, maximum weekly temperature drop, rainfall type indicators (dry/moderate/wet), and climate interaction terms (rain × temperature, rain × spring frequency).

K-Means Word Cluster Probabilities — The 1,416 tasting note keywords were grouped into 9 clusters using K-means clustering on co-occurrence patterns. For each varietal, cluster membership probabilities were computed by averaging word probabilities within each cluster, producing a compact 9-dimensional representation that captured coarse semantic groupings. A scree plot justified the choice of 9 clusters at the elbow point of within-cluster sum of squares.

DistilBERT PCA Embeddings — For each varietal, the top 10 most probable tasting note words were concatenated into a pseudo-text and processed through DistilBERT's feature extraction pipeline to obtain 768-dimensional contextual embeddings. PCA reduced these to 5 components, capturing a substantial portion of variance while maintaining computational efficiency.

Word Entropy — Shannon entropy of each sample's word probability distribution (with smoothing), measuring the diversity of language used in reviews. Higher entropy indicated more complex or nuanced descriptions.

Feature Importance Analysis — Mutual information regression identified the most predictive features, enabling a reduction from 33 to 21 features. This elimination step alone improved MAE by 9% (from ~0.022 to ~0.020) without any hyperparameter tuning — demonstrating that thoughtful feature selection can outperform brute-force inclusion.

FACTOR ANALYSIS

To validate that the DistilBERT embeddings captured genuine semantic structure, the team performed a Promax-rotated factor analysis on the 5 PCA components. The rotation enhanced interpretability by allowing correlated factors, revealing five distinct wine profile dimensions:

Factor 1 — Clean, Crisp & White: Weighted heavily on notes like *refreshing*, *clean*, *crisp*, and *white* — corresponding to drier, crisp white wines
Factor 2 — Refreshing & Fruity: More *fruity* yet also *refreshing*, suggesting sweeter, fruity white or rosé wines
Factor 3 — Balanced & Structured: Common descriptors for drinkable, well-made red wines
Factor 4 — Red, Fruity & Sweet: Indicated *bold* yet *sweet* and *fruity* reds
Factor 5 — Bold, Spicy & Dark: Notes pointing to *spicy*, *dark*, and *bold* reds — notably lacking *acidity*, which was ubiquitous across all other factors

The absence of *acidity* in Factor 5 provided meaningful differentiation and demonstrated that the embeddings went beyond surface-level word co-occurrence. Merging the rotation data with varietal labels confirmed the theoretical groupings — Riesling and Sauvignon Blanc loaded on Factor 1, while Syrah and Malbec dominated Factor 5. Six high-frequency words (palate, flavors, finish, aromas, notes, wine) appeared in 70%+ of all samples and were removed from the factor analysis as uninformative.

This analysis confirmed that DistilBERT embeddings, combined with PCA and factor rotation, captured interpretable semantic structure in tasting vocabulary — providing the model with a principled representation of varietal-level flavor profiles.

MODELING APPROACH

Three complementary modeling strategies were developed, each contributing different strengths:

Multi-Resolution LightGBM Ensemble — A gradient boosting approach using Tweedie loss, well-suited for the zero-inflated, continuous target distribution. K-means clustering grouped semantically similar target words at multiple resolutions (30, 50, 70, 90 clusters), with a separate LightGBM model trained per cluster at each resolution. Predictions were averaged across resolutions, capturing both coarse and fine-grained semantic patterns. The climate-only variant achieved MAE 0.02541; adding word cluster probabilities, DistilBERT embeddings, word entropy, and interaction features brought this down to MAE 0.02061.

PyTorch Encoder Transformer — A custom transformer-based architecture (ClimateEncoderTransformer) inspired by BERT but tailored for multi-label regression. The model processes a 2-token sequence — one for the varietal (via nn.Embedding) and one for climate features (via linear projection) — through a transformer encoder with self-attention. The architecture used learned positional embeddings, GELU activation, and extracted the climate token's representation after attention to produce 1,416 keyword probability predictions via sigmoid. MAE loss was augmented with a mean penalty term to stabilize the output distribution around the dataset mean (~0.0356).

Feedforward Neural Network — A 512→256 two-hidden-layer network with LayerNorm, ReLU, and 45% dropout, trained with focal loss to address the 80% label sparsity. While not the strongest performer, its deterministic architecture made it ideal for SHAP-based explainability analysis, revealing how individual features drove specific tasting note predictions.

HYPERPARAMETER TUNING

LightGBM — Bayesian optimization tuned learning rate, number of leaves, max depth, estimators, feature fraction, L1/L2 regularization, min child weight, min split gain, and Tweedie variance power. The best configuration used learning rate 0.028, 70 leaves, depth 16, 1,000 estimators, and cluster resolutions of 40/80/110. Post-prediction sparsity thresholding aligned outputs with the test data's distributional characteristics.

Transformer — Hyperparameters were tuned across three categories using Bayesian optimization via Ax/BoTorch, with correlation heatmaps guiding the search:

Network architecture: d_model (512), mlp_mult (8 → FFN dim 4,096), num_heads (4–8), num_layers (2–3), learned vs. sinusoidal positional embeddings
Logits/output: lambda_mean (mean penalty weight), learning rate, temperature scaling
AdamW optimizer: β₁, β₂, epsilon, weight decay for L2 regularization

The best single transformer model (2 layers, 4 heads, d_model=512) achieved MAE 0.01879. Ensembling with a second model (3 layers, 8 heads, d_model=512) reduced this to MAE 0.01873.

Feedforward NN — Randomized grid search across dropout rates, learning rates, batch sizes, and hidden dimensions, with 5-fold cross-validation during early experimentation. Training was monitored via validation MAE, F1, and mean probability tracking.

EXPLAINABILITY

The team applied SHAP (SHapley Additive exPlanations) to the feedforward neural network to understand how individual features influenced specific tasting note predictions — going beyond global feature importance to per-prediction explanations.

SHAP analysis compared predictions for two tasting notes with different confidence levels (0.82 vs. 0.41 predicted probability). Key findings:

Aggregated features dominated: Engineered features like num_rainy_days and varietal_embedding consistently had larger SHAP values than granular daily weather measurements, validating the feature engineering strategy
Rainfall frequency was the strongest signal: num_rainy_days was the most influential climate feature, reflecting its impact on grape development, disease risk, and harvest timing
Feature engineering improved interpretability alongside accuracy: The model's preference for aggregated inputs confirmed that reducing dimensionality through thoughtful engineering — rather than feeding raw daily data — improved both performance and transparency

This explainability analysis reinforced a broader principle the team emphasized: as AI systems are increasingly used in decision-making, tools like SHAP provide the context necessary for transparency and accountability — whether predicting wine flavors or informing higher-stakes policy decisions.

RESULTS

The final ensemble achieved an MAE of 0.01873 — earning 2nd place in the class-wide Kaggle competition. The progression from baseline to final model demonstrates the cumulative impact of feature engineering and model selection:

LightGBM (climate-only): MAE = 0.02541
+ Feature importance reduction (33→21 features): MAE ≈ 0.020 (9% improvement)
LightGBM (climate + word/DistilBERT features): MAE = 0.02061
Transformer encoder (single model, 2-layer): MAE = 0.01879
Transformer ensemble (2-layer + 3-layer): MAE = 0.01873 (final, 2nd place)

Beyond the competition score, the project produced several notable findings. The mutual information-based feature importance analysis proved that less is more — removing 12 low-signal features improved performance without any model changes. The DistilBERT PCA embeddings, validated through Promax factor analysis, demonstrated that pretrained language models can extract interpretable semantic structure from domain-specific vocabulary — the five recovered wine profile dimensions aligned precisely with established wine categorization (crisp whites through bold reds). And SHAP analysis confirmed that the most impactful predictors were carefully engineered aggregations, not raw granular data — a lesson in the enduring value of thoughtful feature design alongside modern deep learning architectures.

Tech Stack

RPythonPyTorchLightGBMDistilBERTHugging Facescikit-learnK-MeansPCASHAPPlotly

Details

Team

Steve Meadows, Lauryn Davis, Brooke Walters

Course

CIS 678 — Machine Learning

Timeline

Winter 2025

View Notebook GitHub