ml1st Place

kNN RECOMMENDER SYSTEM

Avian Species Classification

kNNRecommender SystemsClassificationHyperparameter TuningHPC

Key Metrics

0.0357

FINAL MAE

-25.3%

VS. THRESHOLD

85+

SPECIES COUNT

2,000+

HYPERPARAMS TESTED

27.2%

MAE IMPROVEMENT

1st Place

COMPETITION RANK

OVERVIEW

Grand Valley's Machine Learning course (CIS 678) challenged students to identify underrepresented bird species in data collected across the United States from eBird, a citizen science platform where birders record their sightings. The dataset includes over 85 species with extreme class imbalance — counts ranging from 300 for rare species to over 160,000 for abundant ones like the Common Myna. The objective: build a recommender system that predicts missing bird species from observation checklists, evaluated by Mean Absolute Error (MAE) on a class-wide Kaggle leaderboard.

EXPLORATORY ANALYSIS

Initial analysis revealed severe variability across species counts. Quintile analysis showed that Q5 species exhibited the largest standard deviation, while Q1 had a tightly clustered, narrow range. Over 80% of the dataset consisted of zero values, confirming high sparsity — a critical factor in choosing Manhattan distance as the primary distance metric, given its robustness to sparse, high-dimensional data. This sparsity analysis also motivated the development of a custom soft-zero weighting scheme later in the project.

APPROACH

The team implemented a k-Nearest Neighbors algorithm built from scratch in R with several custom extensions:

Distance Metric Selection — Manhattan distance was chosen over Euclidean and Cosine alternatives after initial testing showed it outperformed both, likely due to its robustness to outliers and suitability for high-dimensional sparse data.

Decay Functions — Exponential decay (w = e^(-d/σ)) and triangular decay (w = max(0, 1-d/r)) were implemented to weight closer neighbors more heavily. Exponential decay consistently outperformed triangular, providing smoother weight transitions.

Soft-Zero Weighting — A novel contribution: a conditional scalar that down-weights zero-valued features in test columns during distance calculation. This prevents missing observations from artificially inflating distances, allowing the algorithm to focus on meaningful non-zero co-occurrences.

Log Normalization — Log-transformation of the data compressed the distance distribution, preventing high-count species from dominating calculations. This yielded an 18.5% improvement in MAE over non-normalized data.

HPC Parallel Processing — Leveraged the doParallel package across multiple CPU cores, reducing runtime from ~10 minutes to ~4 minutes per iteration.

HYPERPARAMETER TUNING

Hyperparameter tuning proceeded through multiple passes of increasing resolution, visualized with interactive 3D scatter plots (Plotly):

Pass 1 — Broad search: σ ∈ [5, 50] (step 5), scalar ∈ [0.1, 1.0] (step 0.1). Best result: σ=10, scalar=0.6, Kaggle MAE = 0.0447.

Pass 2 — Focused search: σ ∈ [5, 12] (step 0.25), scalar ∈ [0.15, 0.75] (step 0.05). Best result: σ=9, scalar=0.4, MAE = 0.0444.

Pass 3 — High resolution: σ ∈ [7, 9.75] (step 0.05), scalar ∈ [0.3, 0.5] (step 0.01). Revealed interesting trough structures at σ ≈ 9 and σ ≈ 7.75.

Pass 4 (Normalized) — After log-transformation with k=17: σ ∈ [0.5, 5] and scalar ∈ [0.1, 0.8], over 2,000+ combinations. Optimal: σ=1.5, scalar=0.25, MAE = 0.0362.

Final Grid Search — Added k ∈ [3, 71] as a third dimension. Animated 3D plots showed that increasing k further improved results. Final submission: MAE = 0.0357.

To enable this scale of testing, the team pre-computed a global distance matrix and implemented an 80/20 internal cross-validation split, allowing hundreds of MAE calculations in minutes without hitting Kaggle's submission limits.

RESULTS

The final model achieved an MAE of 0.0357 — placing 25.3% below the recommended threshold and earning 1st place in the class-wide Kaggle competition. The progression tells the story of systematic optimization:

Baseline (unweighted, k=19): MAE ≈ 0.049
+ Exponential decay (σ=10): MAE = 0.046
+ Soft-zero weighting (scalar=0.6): MAE = 0.0447
+ Granular tuning (σ=9, scalar=0.4): MAE = 0.0444
+ Log normalization (σ=1.5, scalar=0.25, k=17): MAE = 0.0362 (18.5% improvement)
+ Full grid search (k optimized): MAE = 0.0357 (final, 1st place)

FUTURE WORK

Several extensions were identified for further optimization: K-means clustering to engineer features capturing bird co-occurrence patterns; PCA for dimensionality reduction prior to clustering; and model stacking to combine kNN with clustering-derived features. Although the team began exploring this direction, the substantial gains achieved through decay weighting, soft-zero innovation, and normalization led to prioritizing those efforts for the competition deadline.

Tech Stack

RkNNManhattan DistanceExponential DecayLog NormalizationParallel ComputingPlotlyggplot2

Details

Team

Steve Meadows, Lauryn Davis, Brooke Walters

Course

CIS 678 — Machine Learning

Timeline

Fall 2024

View Notebook GitHub