stats

ORDER HISTORY DFA

Multivariate Time Series Dimensionality Reduction

Time SeriesDFAForecastingState-Space ModelsMARSSDimensionality ReductionPromax RotationAICc

Key Metrics

INPUT VARIABLES

LATENT TRENDS

55%

VARIANCE EXPLAINED

11%

MAE IMPROVEMENT

RMSE IMPROVEMENT

m=1–12

MODELS TESTED

OVERVIEW

During a data science internship at a durable goods manufacturer, the goal was to build a statistical model forecasting monthly order volume. The model incorporated ~39 time series variables — a mix of internal pipeline metrics and external economic indicators spanning unemployment, commercial real estate, business sentiment, monetary policy, industry-specific demand, and professional construction indices.

Inspired by graduate coursework in multivariate statistics, the initial approach explored applying Principal Component Analysis (PCA) and Principal Axis Factoring (PAF) to reduce the dimensionality of the feature space. However, PCA and PAF assume static, non-temporal data — they extract latent factors from a covariance matrix computed across independent observations, which fundamentally breaks down when observations are autocorrelated time series. The extracted components would conflate temporal dynamics with cross-sectional variance, producing misleading and unstable factors.

The solution was Dynamic Factor Analysis (DFA), a state-space modeling approach designed explicitly for multivariate time series. DFA models the observed data as generated by a smaller set of latent dynamic factors that evolve over time according to a stochastic process — specifically, a linear dynamical system. This allows DFA to capture temporal dependencies while simultaneously reducing dimensionality, yielding interpretable latent trends rather than static components.

Implementation used the MARSS (Multivariate Autoregressive State-Space) package in R, which provides robust tools for fitting these models via Expectation-Maximization and BFGS optimization. After careful model selection based on AICc criteria, the analysis settled on 6 latent trends explaining 55% of the variance across the full variable set, with Promax oblique rotation producing interpretable factors that mapped intuitively to real economic constructs.

Due to the proprietary nature of the internship data, a companion Kaggle notebook was created to demonstrate the DFA approach on a public dataset (Bike Sales Data of 100K), validating that the dimensionality reduction produces equal or better prediction accuracy with dramatically fewer features.

DATA PREPARATION

The dataset comprised 39 time series variables spanning internal order pipeline metrics and external economic indicators. Preparing this data for MARSS required a rigorous multi-step stationarity pipeline:

Stationarity Testing — Each variable was tested using both the Augmented Dickey-Fuller (ADF) test (null: unit root present) and the KPSS test (null: series is stationary). Using both tests in tandem provided robust classification, flagging variables where ADF p > 0.05 or KPSS p < 0.05 as non-stationary.

First Differencing — Non-stationary series were first-differenced to remove trends. After differencing, stationarity was re-tested on all variables.

Second Differencing — Variables that remained non-stationary after first differencing received a second round of differencing, with stationarity re-confirmed via ADF and KPSS.

Variance Thresholding — After differencing, a variance cutoff (1e-3) was applied to remove near-zero-variance columns that would provide no useful signal to the model.

Special Handling of the Federal Funds Rate — The federal funds rate exhibited unique behavior as a level variable managed by policy decisions rather than market forces. Rather than differencing it (which destroyed its interpretive value), it was removed from the differencing pipeline, tested separately for stationarity, and reattached to the processed dataset as a raw level variable aligned to the differenced time indices.

Redundant Variable Consolidation — Iterative analysis revealed several groups of highly correlated variables that inflated factor loadings. Return-to-office metropolitan indices were consolidated to a single representative variable, multiple industry stock prices were consolidated, and sparse variables (e.g., infrequent price-change data) were removed entirely.

Z-Score Standardization — The MARSS package requires all input series to be z-scored (zero mean, unit variance), ensuring that variables with different scales contribute equally to the factor estimation. The standardized data matrix was then transposed to MARSS's expected format (series in rows, time points in columns).

MODELING

The core of the analysis used the MARSS state-space framework, which models the data through two coupled equations:

Observation equation: y(t) = Z·x(t) + a + v(t) — the observed variables y are linear combinations of latent states x, plus offsets a and observation noise v
State equation: x(t) = B·x(t-1) + u + w(t) — the latent states evolve as a random walk with drift u and process noise w

The model was configured with diagonal and unequal observation and process error covariance matrices (R and Q), zero intercepts (A = "zero"), identity state transition (B = "identity" for random walk dynamics), and high initial state uncertainty (V0 = diag(10, m)).

BFGS Optimization — Model fitting used the BFGS quasi-Newton method with stringent convergence criteria: maximum 10,000 iterations and a relative tolerance of 1e-8. This provided more reliable convergence than the default EM algorithm for models with many parameters.

Model Selection — Models were fitted across m = 1 to 12 latent trends, with each model's fit evaluated via AICc (corrected Akaike Information Criterion), which penalizes model complexity to prevent overfitting. The AICc plot indicated that 3–8 trends represented the optimal range, with diminishing returns beyond 6.

Iterative Refinement — The modeling process was iterative rather than single-pass:

Initial run (all variables, m = 1–12): Identified m = 6 as optimal by AICc, but Promax rotation revealed only 38% variance explained with overloaded return-to-office factors
RTO consolidation (reduced correlated variables): Re-ran m = 4–6, found m = 4 optimal by AICc but showed signs of overfitting in loadings
Final selection (m = 6 after consolidation): Achieved 55% variance explained with balanced, interpretable factor loadings distributed across three dominant trends

Parallel Computing — The `doParallel` and `foreach` packages enabled parallelized model fitting across multiple trend counts, distributing MARSS fits across all available CPU cores to reduce total computation time across the model selection grid.

FACTOR INTERPRETATION

After fitting the m = 6 model, the raw loadings matrix was rotated to improve interpretability.

Promax Oblique Rotation — Unlike Varimax (orthogonal), Promax allows factors to be correlated, which is more realistic for economic time series where underlying forces often co-move. The Promax-rotated loadings matrix was computed using the `psych` and `GPArotation` packages, producing six interpretable trends:

Trend 1 — Unemployment dynamics (6.1% variance): Loaded most heavily on labor market indicators, tracking closely with the unemployment rate
Trend 2 — Commercial real estate activity (1.0% variance): Captured office leasing and commercial property dynamics
Trend 3 — Business and consumer sentiment (16.5% variance): The dominant trend, driven by CEO confidence indices and consumer sentiment measures — a "vibes" factor capturing the overall economic mood
Trend 4 — Monetary policy (13.6% variance): Tracked the federal funds rate and interest rate environment, capturing central bank policy impacts on the industry
Trend 5 — Industry-specific demand indicators (4.4% variance): Loaded on sector-specific demand metrics and trade association indices
Trend 6 — Professional construction activity (3.1% variance): Captured professional and institutional construction spending, a leading indicator for durable goods demand

Trend Overlay Validation — Each Promax-rotated trend was plotted overlaid with its highest-loading observed variable. In every case, the latent trend tracked its corresponding economic indicator closely, providing visual confirmation that the factors captured genuine economic dynamics rather than statistical artifacts.

Correlation Heatmap — The correlation matrix of the six Promax-rotated trends was computed and visualized as a heatmap. The trends exhibited near-independence, confirming that each factor captured a distinct economic dimension — exactly the desirable property for dimensionality reduction where each trend should represent a different underlying force.

PREDICTION VALIDATION

To demonstrate the practical value of DFA dimensionality reduction without exposing proprietary data, a companion Kaggle notebook was created using the publicly available Bike Sales Data of 100K dataset — chosen for its matching monthly time interval and durable goods domain.

The validation compared two prediction approaches using the statsforecast library in Python:

Setup — The 6 Promax-rotated DFA trends (extracted from the proprietary analysis) were joined with monthly aggregated bike sales data. The dataset was split into 21 training months and 6 test months, with both DFA trends and raw variables tested as exogenous inputs to the same forecasting models.

MFLES Model (Multi-Frequency Locally Estimated Scatterplot Smoothing): • Using 6 DFA trends: MAE = 719,515 | RMSE = 866,562 • Using 41 raw variables: MAE = 798,884 | RMSE = 892,382 • Result: 11% lower MAE and 3% lower RMSE with DFA trends

MSTL Model (Multiple Seasonal-Trend decomposition using LOESS): • Using 6 DFA trends: MAE = 741,814 | RMSE = 819,393 • Using 41 raw variables: MAE = 741,814 | RMSE = 819,393 • Result: Identical performance — MSTL relies primarily on seasonal decomposition, so exogenous variables had less influence

The key finding: 6 DFA trends produced equal or better predictions than 41 raw variables, while offering dramatically simpler inputs. The MFLES model — which more heavily leverages exogenous variables — showed a clear advantage for DFA trends, likely because the latent trends filtered noise from the raw features and captured the underlying economic signal more cleanly.

KEY TAKEAWAYS

DFA as an underutilized tool — Dynamic Factor Analysis occupies a unique niche in the data science toolkit: it is specifically designed for the intersection of dimensionality reduction and time series analysis where standard PCA/PAF methods fail. Despite its power, DFA is rarely encountered in applied data science, making it a valuable differentiator.

The efficiency argument — Reducing 39 variables to 6 trends means simpler models, faster training, easier interpretability, and lower risk of overfitting. For production forecasting systems where model retraining occurs regularly, this efficiency compounds over time.

The accuracy argument — The Kaggle validation demonstrated that latent trends can outperform raw features by removing noise and capturing the true underlying economic signal. The 11% MAE improvement with MFLES suggests that DFA trends are not just a compression convenience but can genuinely improve predictive power.

Practical applicability — The methodology generalizes to any domain with multivariate time series: finance, healthcare, supply chain, energy, climate science. Any forecasting problem where numerous correlated time series drive an outcome is a candidate for DFA.

Dual-language pipeline — The project demonstrated a cross-language data science workflow: R (MARSS, psych, GPArotation) for the statistical modeling where R's ecosystem excels, and Python (pandas, statsforecast, matplotlib) for the prediction comparison where Python's forecasting libraries are stronger. This pragmatic language choice — using each tool where it is strongest — reflects the reality of applied data science work.

Tech Stack

RMARSSpsychGPArotationtseriesdoParallelforeachggplot2dplyrtidyrreshape2Pythonpandasstatsforecastmatplotlib

Details

Team

Steve Meadows

Timeline

2024–2025

View Notebook GitHub