WINE AI
Predicting Tasting Notes from Climate
Key Metrics
0.01873
FINAL MAE
10,000+
WINES ANALYZED
64
GRAPE VARIETALS
1,416
TASTING KEYWORDS
33→21
FEATURE REDUCTION
2nd Place
COMPETITION RANK
OVERVIEW
Grand Valley's Machine Learning course (CIS 678) challenged students to predict the probability of specific tasting note keywords appearing in a wine review, using only climate data and grape varietal as inputs. The dataset comprised over 10,000 wines across 64 grape varietals, each associated with daily climate data from the 2022 vintage year — minimum and maximum temperatures, rainfall, and sunshine duration spanning March 1 to November 1 (246 days × 4 modalities = 984 raw climate features). The target variable was a matrix of 1,416 tasting note keyword probabilities derived from smoothed word count data.
By modeling these relationships, the project explores how environmental factors influence the language used to describe wine — offering insights into how climate conditions shape sensory perception and market reception. Performance was evaluated by Mean Absolute Error (MAE) on a held-out test set of 1,001 wines, scored through a class-wide Kaggle competition.
EXPLORATORY ANALYSIS
Distribution analysis of the four climate modalities revealed distinct statistical profiles. Maximum and minimum temperatures followed approximately normal distributions centered around seasonal norms, while rainfall was heavily right-skewed with many zero-rain days, and sunshine duration exhibited a bimodal pattern reflecting cloud-cover variability. These distributional differences motivated modality-specific normalization: standard scaling for temperature features, log-transformation for rainfall (to compress the heavy tail), and log + standard scaling for sunshine.
The word probability matrix was extremely sparse — approximately 80% zeros — with most tasting notes appearing infrequently across the corpus. This sparsity became a central design constraint, influencing loss function selection (focal loss for the neural network, Tweedie loss for LightGBM) and motivating the development of cluster-based features that aggregate sparse word signals into denser, more predictive representations.
FEATURE ENGINEERING
Feature engineering was the backbone of this project, contributing more to model performance than architecture selection alone. The team developed five categories of engineered features:
Climate Aggregations — Moving averages at 7, 14, and 30-day windows for each modality, plus seasonal averages (spring, summer, fall rainfall), number of rainy days, maximum weekly temperature drop, rainfall type indicators (dry/moderate/wet), and climate interaction terms (rain × temperature, rain × spring frequency).
K-Means Word Cluster Probabilities — The 1,416 tasting note keywords were grouped into 9 clusters using K-means clustering on co-occurrence patterns. For each varietal, cluster membership probabilities were computed by averaging word probabilities within each cluster, producing a compact 9-dimensional representation that captured coarse semantic groupings. A scree plot justified the choice of 9 clusters at the elbow point of within-cluster sum of squares.
DistilBERT PCA Embeddings — For each varietal, the top 10 most probable tasting note words were concatenated into a pseudo-text and processed through DistilBERT's feature extraction pipeline to obtain 768-dimensional contextual embeddings. PCA reduced these to 5 components, capturing a substantial portion of variance while maintaining computational efficiency.
Word Entropy — Shannon entropy of each sample's word probability distribution (with smoothing), measuring the diversity of language used in reviews. Higher entropy indicated more complex or nuanced descriptions.
Feature Importance Analysis — Mutual information regression identified the most predictive features, enabling a reduction from 33 to 21 features. This elimination step alone improved MAE by 9% (from ~0.022 to ~0.020) without any hyperparameter tuning — demonstrating that thoughtful feature selection can outperform brute-force inclusion.
FACTOR ANALYSIS
To validate that the DistilBERT embeddings captured genuine semantic structure, the team performed a Promax-rotated factor analysis on the 5 PCA components. The rotation enhanced interpretability by allowing correlated factors, revealing five distinct wine profile dimensions:
- Factor 1 — Clean, Crisp & White: Weighted heavily on notes like *refreshing*, *clean*, *crisp*, and *white* — corresponding to drier, crisp white wines
- Factor 2 — Refreshing & Fruity: More *fruity* yet also *refreshing*, suggesting sweeter, fruity white or rosé wines
- Factor 3 — Balanced & Structured: Common descriptors for drinkable, well-made red wines
- Factor 4 — Red, Fruity & Sweet: Indicated *bold* yet *sweet* and *fruity* reds
- Factor 5 — Bold, Spicy & Dark: Notes pointing to *spicy*, *dark*, and *bold* reds — notably lacking *acidity*, which was ubiquitous across all other factors
The absence of *acidity* in Factor 5 provided meaningful differentiation and demonstrated that the embeddings went beyond surface-level word co-occurrence. Merging the rotation data with varietal labels confirmed the theoretical groupings — Riesling and Sauvignon Blanc loaded on Factor 1, while Syrah and Malbec dominated Factor 5. Six high-frequency words (palate, flavors, finish, aromas, notes, wine) appeared in 70%+ of all samples and were removed from the factor analysis as uninformative.
This analysis confirmed that DistilBERT embeddings, combined with PCA and factor rotation, captured interpretable semantic structure in tasting vocabulary — providing the model with a principled representation of varietal-level flavor profiles.
MODELING APPROACH
Three complementary modeling strategies were developed, each contributing different strengths:
Multi-Resolution LightGBM Ensemble — A gradient boosting approach using Tweedie loss, well-suited for the zero-inflated, continuous target distribution. K-means clustering grouped semantically similar target words at multiple resolutions (30, 50, 70, 90 clusters), with a separate LightGBM model trained per cluster at each resolution. Predictions were averaged across resolutions, capturing both coarse and fine-grained semantic patterns. The climate-only variant achieved MAE 0.02541; adding word cluster probabilities, DistilBERT embeddings, word entropy, and interaction features brought this down to MAE 0.02061.
PyTorch Encoder Transformer — A custom transformer-based architecture (ClimateEncoderTransformer) inspired by BERT but tailored for multi-label regression. The model processes a 2-token sequence — one for the varietal (via nn.Embedding) and one for climate features (via linear projection) — through a transformer encoder with self-attention. The architecture used learned positional embeddings, GELU activation, and extracted the climate token's representation after attention to produce 1,416 keyword probability predictions via sigmoid. MAE loss was augmented with a mean penalty term to stabilize the output distribution around the dataset mean (~0.0356).
Feedforward Neural Network — A 512→256 two-hidden-layer network with LayerNorm, ReLU, and 45% dropout, trained with focal loss to address the 80% label sparsity. While not the strongest performer, its deterministic architecture made it ideal for SHAP-based explainability analysis, revealing how individual features drove specific tasting note predictions.
HYPERPARAMETER TUNING
LightGBM — Bayesian optimization tuned learning rate, number of leaves, max depth, estimators, feature fraction, L1/L2 regularization, min child weight, min split gain, and Tweedie variance power. The best configuration used learning rate 0.028, 70 leaves, depth 16, 1,000 estimators, and cluster resolutions of 40/80/110. Post-prediction sparsity thresholding aligned outputs with the test data's distributional characteristics.
Transformer — Hyperparameters were tuned across three categories using Bayesian optimization via Ax/BoTorch, with correlation heatmaps guiding the search:
- Network architecture: d_model (512), mlp_mult (8 → FFN dim 4,096), num_heads (4–8), num_layers (2–3), learned vs. sinusoidal positional embeddings
- Logits/output: lambda_mean (mean penalty weight), learning rate, temperature scaling
- AdamW optimizer: β₁, β₂, epsilon, weight decay for L2 regularization
The best single transformer model (2 layers, 4 heads, d_model=512) achieved MAE 0.01879. Ensembling with a second model (3 layers, 8 heads, d_model=512) reduced this to MAE 0.01873.
Feedforward NN — Randomized grid search across dropout rates, learning rates, batch sizes, and hidden dimensions, with 5-fold cross-validation during early experimentation. Training was monitored via validation MAE, F1, and mean probability tracking.
EXPLAINABILITY
The team applied SHAP (SHapley Additive exPlanations) to the feedforward neural network to understand how individual features influenced specific tasting note predictions — going beyond global feature importance to per-prediction explanations.
SHAP analysis compared predictions for two tasting notes with different confidence levels (0.82 vs. 0.41 predicted probability). Key findings:
- Aggregated features dominated: Engineered features like num_rainy_days and varietal_embedding consistently had larger SHAP values than granular daily weather measurements, validating the feature engineering strategy
- Rainfall frequency was the strongest signal: num_rainy_days was the most influential climate feature, reflecting its impact on grape development, disease risk, and harvest timing
- Feature engineering improved interpretability alongside accuracy: The model's preference for aggregated inputs confirmed that reducing dimensionality through thoughtful engineering — rather than feeding raw daily data — improved both performance and transparency
This explainability analysis reinforced a broader principle the team emphasized: as AI systems are increasingly used in decision-making, tools like SHAP provide the context necessary for transparency and accountability — whether predicting wine flavors or informing higher-stakes policy decisions.
RESULTS
The final ensemble achieved an MAE of 0.01873 — earning 2nd place in the class-wide Kaggle competition. The progression from baseline to final model demonstrates the cumulative impact of feature engineering and model selection:
- LightGBM (climate-only): MAE = 0.02541
- + Feature importance reduction (33→21 features): MAE ≈ 0.020 (9% improvement)
- LightGBM (climate + word/DistilBERT features): MAE = 0.02061
- Transformer encoder (single model, 2-layer): MAE = 0.01879
- Transformer ensemble (2-layer + 3-layer): MAE = 0.01873 (final, 2nd place)
Beyond the competition score, the project produced several notable findings. The mutual information-based feature importance analysis proved that less is more — removing 12 low-signal features improved performance without any model changes. The DistilBERT PCA embeddings, validated through Promax factor analysis, demonstrated that pretrained language models can extract interpretable semantic structure from domain-specific vocabulary — the five recovered wine profile dimensions aligned precisely with established wine categorization (crisp whites through bold reds). And SHAP analysis confirmed that the most impactful predictors were carefully engineered aggregations, not raw granular data — a lesson in the enduring value of thoughtful feature design alongside modern deep learning architectures.
Tech Stack
Details
Team
Steve Meadows, Lauryn Davis, Brooke Walters
Course
CIS 678 — Machine Learning
Timeline
Winter 2025