kNN RECOMMENDER SYSTEM
Avian Species Classification
Key Metrics
0.0357
FINAL MAE
-25.3%
VS. THRESHOLD
85+
SPECIES COUNT
2,000+
HYPERPARAMS TESTED
27.2%
MAE IMPROVEMENT
1st Place
COMPETITION RANK
OVERVIEW
Grand Valley's Machine Learning course (CIS 678) challenged students to identify underrepresented bird species in data collected across the United States from eBird, a citizen science platform where birders record their sightings. The dataset includes over 85 species with extreme class imbalance — counts ranging from 300 for rare species to over 160,000 for abundant ones like the Common Myna. The objective: build a recommender system that predicts missing bird species from observation checklists, evaluated by Mean Absolute Error (MAE) on a class-wide Kaggle leaderboard.
EXPLORATORY ANALYSIS
Initial analysis revealed severe variability across species counts. Quintile analysis showed that Q5 species exhibited the largest standard deviation, while Q1 had a tightly clustered, narrow range. Over 80% of the dataset consisted of zero values, confirming high sparsity — a critical factor in choosing Manhattan distance as the primary distance metric, given its robustness to sparse, high-dimensional data. This sparsity analysis also motivated the development of a custom soft-zero weighting scheme later in the project.
APPROACH
The team implemented a k-Nearest Neighbors algorithm built from scratch in R with several custom extensions:
Distance Metric Selection — Manhattan distance was chosen over Euclidean and Cosine alternatives after initial testing showed it outperformed both, likely due to its robustness to outliers and suitability for high-dimensional sparse data.
Decay Functions — Exponential decay (w = e^(-d/σ)) and triangular decay (w = max(0, 1-d/r)) were implemented to weight closer neighbors more heavily. Exponential decay consistently outperformed triangular, providing smoother weight transitions.
Soft-Zero Weighting — A novel contribution: a conditional scalar that down-weights zero-valued features in test columns during distance calculation. This prevents missing observations from artificially inflating distances, allowing the algorithm to focus on meaningful non-zero co-occurrences.
Log Normalization — Log-transformation of the data compressed the distance distribution, preventing high-count species from dominating calculations. This yielded an 18.5% improvement in MAE over non-normalized data.
HPC Parallel Processing — Leveraged the doParallel package across multiple CPU cores, reducing runtime from ~10 minutes to ~4 minutes per iteration.
HYPERPARAMETER TUNING
Hyperparameter tuning proceeded through multiple passes of increasing resolution, visualized with interactive 3D scatter plots (Plotly):
Pass 1 — Broad search: σ ∈ [5, 50] (step 5), scalar ∈ [0.1, 1.0] (step 0.1). Best result: σ=10, scalar=0.6, Kaggle MAE = 0.0447.
Pass 2 — Focused search: σ ∈ [5, 12] (step 0.25), scalar ∈ [0.15, 0.75] (step 0.05). Best result: σ=9, scalar=0.4, MAE = 0.0444.
Pass 3 — High resolution: σ ∈ [7, 9.75] (step 0.05), scalar ∈ [0.3, 0.5] (step 0.01). Revealed interesting trough structures at σ ≈ 9 and σ ≈ 7.75.
Pass 4 (Normalized) — After log-transformation with k=17: σ ∈ [0.5, 5] and scalar ∈ [0.1, 0.8], over 2,000+ combinations. Optimal: σ=1.5, scalar=0.25, MAE = 0.0362.
Final Grid Search — Added k ∈ [3, 71] as a third dimension. Animated 3D plots showed that increasing k further improved results. Final submission: MAE = 0.0357.
To enable this scale of testing, the team pre-computed a global distance matrix and implemented an 80/20 internal cross-validation split, allowing hundreds of MAE calculations in minutes without hitting Kaggle's submission limits.
RESULTS
The final model achieved an MAE of 0.0357 — placing 25.3% below the recommended threshold and earning 1st place in the class-wide Kaggle competition. The progression tells the story of systematic optimization:
- Baseline (unweighted, k=19): MAE ≈ 0.049
- + Exponential decay (σ=10): MAE = 0.046
- + Soft-zero weighting (scalar=0.6): MAE = 0.0447
- + Granular tuning (σ=9, scalar=0.4): MAE = 0.0444
- + Log normalization (σ=1.5, scalar=0.25, k=17): MAE = 0.0362 (18.5% improvement)
- + Full grid search (k optimized): MAE = 0.0357 (final, 1st place)
FUTURE WORK
Several extensions were identified for further optimization: K-means clustering to engineer features capturing bird co-occurrence patterns; PCA for dimensionality reduction prior to clustering; and model stacking to combine kNN with clustering-derived features. Although the team began exploring this direction, the substantial gains achieved through decay weighting, soft-zero innovation, and normalization led to prioritizing those efforts for the competition deadline.
Tech Stack
Details
Team
Steve Meadows, Lauryn Davis, Brooke Walters
Course
CIS 678 — Machine Learning
Timeline
Fall 2024