BACK TO PROJECTS
ml

SVM & DIM. REDUCTION

Deconstructing Student Achievement

SVMPCAmixDimensionality ReductionBayesian OptimizationFactor AnalysisClass Imbalance

Key Metrics

~90%

AT-RISK RECALL

0.634

TEST PR-AUC

0.924

ROC-AUC

81.5%

OVERALL ACCURACY

10

LATENT FACTORS

6.6:1

CLASS WEIGHT RATIO

OVERVIEW

This project investigates the drivers of secondary student academic success using data from the 2019 Parent and Family Involvement (PFI) Survey — a nationally representative dataset administered by the U.S. Census Bureau on behalf of the National Center for Education Statistics. The dataset contains high-dimensional, mixed-type survey data spanning family demographics, parental engagement behaviors, school climate perceptions, and student outcomes for grades 6–12.

The central question: which latent profiles of family involvement — from household structure to daily homework routines — are most strongly associated with academic success? The target variable was derived by recoding self-reported grades into a binary indicator: students reporting mostly A's/B's were labeled "high success," while those reporting mostly C's or lower were labeled "low success." This binary framing introduced severe class imbalance (~6.6:1 ratio favoring high-success), making minority-class detection the core modeling challenge.

Unlike traditional linear approaches, this analysis addresses the complexity of mixed numeric and categorical variables through a robust machine learning framework: PCAmix for dimensionality reduction, Varimax-rotated factor analysis for interpretability, and a Radial Basis Function SVM tuned via Bayesian Optimization — prioritizing the detection of at-risk students above all else.

DATA & PREPROCESSING

The raw PFI dataset required substantial preprocessing before modeling. Grade variables were standardized so that kindergarten categories collapsed into a single value and grades 1–12 mapped onto a consistent scale. The sample was restricted to secondary students (grades 6–12), and the outcome variable was cleaned by treating special codes as missing before constructing the binary success indicator.

A critical preprocessing decision involved handling the value -1, which the PFI survey uses to indicate a "valid skip" rather than a substantive response. Because PCAmix replaces missing values in quantitative variables with column means, retaining -1 would inject artificial values into the covariance structure. All -1 entries in numeric variables were recoded as NA. For qualitative variables, PCAmix's disjunctive (indicator) matrix treatment naturally handles missing entries by replacing them with zeros — consistent with the absence of a selected response category.

The dataset contained seven continuous variables (homework hours, parent age, household size, work hours, months worked, sibling count, and school participation frequency) alongside dozens of categorical survey responses. This mix of variable types motivated the use of PCAmix, which integrates PCA for quantitative variables with an MCA-like treatment for qualitative variables within a unified component solution.

DIMENSIONALITY REDUCTION

PCAmix was applied to reduce the high-dimensional mixed-type predictor space into interpretable principal components. A scree plot of the first 30 eigenvalues revealed a steep decline through Components 1–10 (eigenvalues > ~1.9), followed by a markedly flattened curve where additional components provided diminishing returns.

While Kaiser's criterion (eigenvalues > 1) would have suggested retaining significantly more components, this heuristic is known to over-retain in large mixed-data sets containing many categorical indicators. Instead, priority was given to components that (a) lie above the discernible elbow in the scree plot, (b) exhibit noticeably larger eigenvalue magnitudes, and (c) support interpretable rotated factors.

A 10-component solution was selected as the optimal balance between parsimony and fidelity — preserving major structural dimensions while excluding weak, noise-driven components. The retained components were projected into both training and test sets using PCAmix's predict function, creating a compact 10-dimensional feature space for downstream modeling.

FACTOR ANALYSIS

To enhance interpretability, a Varimax rotation was applied to the 10 retained components. The rotation pushed variable loadings toward either zero or one, producing a "simple structure" where each variable associates primarily with a single latent factor. A dominant loading approach retained only the strongest factor loading per variable (threshold > 0.30), yielding ten distinct latent constructs:

  • Factor 1 — Household Structure: Two-parent presence, marital status, household size
  • Factor 2 — School Climate: Satisfaction with academic standards, discipline, teachers
  • Factor 3 — Home Engagement: Parental homework help frequency, arts/crafts, games
  • Factor 4 — Parent Demographics: Caregiver relationship, employment, age
  • Factor 5 — Residual School Climate: Secondary variance from school perception variables
  • Factor 6 — SES & Culture: Income, parental education, home language, race/ethnicity
  • Factor 7 — Homework Routines: Student homework hours, frequency, workload perception
  • Factor 8 — Parental Gender: Sex of primary caregiver, weekly work hours
  • Factor 9 — Academic Orientation & Challenges: Expected attainment, disability, health, absenteeism
  • Factor 10 — School Participation: Volunteering, fundraising, committee service, sports attendance

A dumbbell plot of mean standardized factor scores by success group revealed which factors most sharply discriminate between high- and low-performing students — with Household Structure (Factor 1) and Academic Challenges (Factor 9) exhibiting the widest separation.

KEY FINDINGS

Analysis of the factor profiles uncovered three critical themes with real-world policy implications:

The "Reactive Paradox" — The most striking finding was the inversion between Factors 3 and 7. Students receiving the most frequent parental homework help (3+ days/week) exhibited the highest probability of low success, while students who independently completed homework 5+ days/week were significantly more likely to be in the high-success group. This reveals that intensive parental assistance is often a lagging indicator of existing academic deficits — a reactive intervention rather than a proactive driver of achievement. Effective support strategies should prioritize fostering student autonomy over direct remediation.

Structure as the Baseline — Household stability (Factor 1) exhibited the largest magnitude of separation between groups. Students in two-parent households showed overwhelmingly higher success rates, consistent with "Resource Dilution" theory where two guardians increase available time and financial resources per child. This structural advantage acts as a multiplier for all other factors — a student with high structural stability may weather challenges that would derail a student lacking those resources.

Systemic Stratification — Persistent gaps across income, parental education, and race/ethnicity (Factor 6) confirmed that socioeconomic context establishes a foundational gradient for achievement. A notable jump in success rates appeared for parents holding at least a Bachelor's degree, reflecting both economic stability and familiarity with the educational system. These non-academic barriers remain potent inhibitors of achievement, interacting with behavioral factors in complex, non-linear ways.

APPROACH

The modeling pipeline coupled PCAmix dimensionality reduction with a cost-sensitive Radial Basis Function (RBF) SVM classifier. The RBF kernel maps input vectors into a high-dimensional feature space where non-linear relationships become linearly separable — critical for this dataset, where t-SNE visualization revealed multi-modal cluster structures with distinct "islands" of success rather than simple linear separation. This confirmed that student success is heterogeneous: multiple distinct parental profiles — unique combinations of resources, engagement levels, and household structures — can independently lead to high achievement.

To address the severe class imbalance (6.6:1 ratio), class weights proportional to inverse class frequencies were incorporated directly into the SVM's penalty term. Misclassifying a minority "low-success" case incurred nearly 7× the penalty of misclassifying a "high-success" case, preventing the classifier from defaulting to the majority class.

Bayesian Optimization with an Expected Improvement acquisition function tuned cost (C) and gamma (γ) over 5-fold stratified cross-validation, parallelized across multiple CPU cores. Performance was evaluated using Precision-Recall AUC for the minority class — chosen over ROC-AUC because it provides a more direct measure of the classifier's ability to identify and rank at-risk students without inflation from true negatives. The optimization converged on cost = 0.10 and gamma = 0.01, yielding a smooth, heavily regularized decision boundary — consistent with the finding that overly flexible or localized boundaries overfit the majority class and degrade minority-class ranking.

RESULTS

The final 10-component model achieved ~90% recall for at-risk students — correctly identifying 9 out of 10 low-success students — with a test PR-AUC of 0.634 and ROC-AUC of 0.924. The confusion matrix (221 TP, 320 FP, 26 FN, 1,302 TN) reflects the deliberate design priority of minimizing false negatives over false positives.

Comparison across dimensionality levels confirmed the 10-component model's superiority:

  • 10 components: Recall = 89.5%, PR-AUC = 0.634, ROC-AUC = 0.924
  • 5 components: Recall = 81.4%, PR-AUC = 0.517, ROC-AUC declined
  • 3 components: Recall = 78.5%, PR-AUC = 0.475, ROC-AUC = 0.849

Aggressive dimensionality reduction degraded minority-class detection, demonstrating that the full 10-factor representation preserved critical mixed-data structure.

In an educational context, the model's higher false-positive rate carries an asymmetric cost. Students incorrectly flagged as at-risk receive additional mentoring, resources, and support — interventions that reinforce positive behaviors even for students who may not strictly need them. This framing transforms false positives from prediction errors into opportunities to expand support for students on the margins of success, making the model a comprehensive safety net rather than a mere prediction tool.

FUTURE WORK

Future directions include incorporating longitudinal data to determine whether "reactive" parental help eventually transitions into independent success over time — the cross-sectional nature of the 2019 PFI survey limits causal inference. Expanding the feature set to include school-level funding and resource data could help disentangle home environment effects from institutional quality.

The framework could also be extended to multi-class prediction (distinguishing A-students from B-students from C-students) or integrated with ensemble methods for improved precision alongside the existing high recall. Additionally, the factor analysis could be refined with oblique rotation methods to allow correlated factors, potentially revealing interaction effects between household structure and engagement behaviors that the orthogonal Varimax rotation may have suppressed.

Tech Stack

RPCAmixSVM (RBF Kernel)Bayesian Optimizationt-SNEParallel ComputingPlotlyggplot2

Details

Team

Steve Meadows

Course

CIS 678 — Machine Learning

Timeline

Fall 2024