GUN VIOLENCE ANALYSIS
Geospatial Intelligence
Key Metrics
~239K
INCIDENTS ANALYZED
99.8%
GEOID RECOVERY
p = 0.033
WELCH'S T-TEST
p = 0.011
PERMUTATION TEST
[11.6, 48.0]
BOOTSTRAP CI
36%
ST. LOUIS LETHALITY
OVERVIEW
This project is a comprehensive geospatial and statistical analysis of gun violence in the United States, completed for STA 418/518 — Introduction to Statistics at Grand Valley State University. The primary dataset comes from the Gun Violence Archive (GVA), a nonprofit that collects verified records of gun-related incidents across the U.S. The raw dataset contains approximately 239,000 incidents spanning January 2013 through March 2018, with each record capturing location (state, city, geoid, latitude/longitude), date, number killed, number injured, and dozens of participant-level attributes.
Because 2013 and 2018 are incomplete in the archive (only 279 incidents from 2013, notably missing the Las Vegas mass shooting), the analysis filters to a clean four-year window: 2014–2017, providing consistent year-over-year comparisons. To move beyond raw counts, the project integrates U.S. Census Bureau state-level and county-level demographic data — population, median income, poverty rates, rent costs, housing costs, and gender proportions — enabling per-capita normalization and socioeconomic correlation analysis.
The project proceeds through a deliberate analytical arc: data cleaning and missingness recovery, state-level exploratory analysis with per-capita normalization, city-level drill-down with correlation matrices, interactive geospatial visualization via Leaflet, and rigorous inferential statistics — Welch's t-test, permutation testing, and bootstrap confidence intervals — all converging on a central finding about the relationship between poverty and gun violence lethality.
DATA CLEANING
The raw GVA dataset required several preprocessing steps before analysis. Date strings were parsed to datetime format using lubridate, enabling extraction of year and month for temporal analysis. City names were cleaned by stripping parenthetical qualifiers and trimming whitespace to ensure consistent joins with Census data.
Missingness Analysis — An initial missingness plot (via naniar's gg_miss_var) revealed that latitude, longitude, and geoid values were the most incomplete fields — and their missingness was correlated: rows missing lat/long were almost always missing geoid as well. A state-level missingness heatmap identified Idaho, Indiana, South Dakota, Virginia, and West Virginia as the states with the highest proportions of missing location data.
Geoid Recovery Strategy — The geoid (geographic identifier) is critical for merging incident data with Census Bureau shapefiles and demographic tables. Over 8,000 rows were missing geoid values. Rather than dropping these records, the project implemented a custom imputation approach: for each incident missing a geoid, the system cross-referenced other incidents sharing the same city, state, and year that had valid geoids, and copied the geoid over. This strategy reduced missing geoid values from 8,000+ to just 443 — achieving 99.8% completeness for the geographic identifier. Post-recovery, only Hawaii, Idaho, Vermont, and West Virginia retained missingness above 1%, and these states contribute minimally to the top incident and death counts.
Post-Cleaning Validation — After recovery, a bar plot of remaining missing geoids by state confirmed that the highest absolute counts of nulls occurred in high-incident states (where a few hundred missing out of tens of thousands is negligible), while the proportional missingness in smaller states was confirmed as statistically inconsequential to downstream analysis.
STATE-LEVEL ANALYSIS
State-level analysis began with 2017 — the year with the highest incident count — using raw incident totals. The initial bar chart showed Illinois, California, Florida, and Texas dominating, which tracks intuitively with population. But raw counts are misleading.
Per-Capita Normalization — By merging state-level Census population data and computing incidents per 100,000 people, the rankings shifted dramatically. The District of Columbia emerged as an extreme outlier — but DC is a federal district functioning more like a single city than a state, so it was excluded for fair state-to-state comparison. With DC removed, Alaska took the top spot, followed by Delaware — two states rarely associated with gun violence in popular discourse. Illinois, despite its reputation, ranked substantially lower after population adjustment.
Gun Ownership Tangent — Alaska's per-capita dominance prompted investigation into gun ownership rates. Alaska has an estimated 64.5% gun ownership rate (the highest in the nation), suggesting a plausible correlative factor. However, Delaware — ranked 2nd in per-capita incidents — has only 34.4% gun ownership (10th lowest), and Illinois ranks 7th lowest at 27.8%. This inconsistency between ownership rates and incident rates indicated that gun ownership alone is an insufficient explanatory variable, motivating deeper analysis at the city level.
Interactive Leaflet Choropleth — To visualize the geographic distribution, U.S. Census Bureau TIGER/Line shapefiles (cb_2018_us_state_500k) were loaded via the sf package, merged with the per-capita incident data on formatted GEOID, transformed to WGS84 CRS, and rendered as a Leaflet choropleth map. The map uses population-normalized coloring (PuBu palette with quantile-based bins) and HTML popup labels displaying state name, population, raw incident count, and per-capita rate on hover.
Boxplot of Top 15 States (2014–2017) — A horizontal boxplot visualized the range of per-capita incident rates across all four years for the top 15 states, showing which states had consistent rates versus high year-to-year variability. This temporal stability check confirmed that the 2017 rankings were representative of broader trends, not single-year anomalies.
CITY-LEVEL ANALYSIS
Having established the state-level landscape, the analysis drilled down to the city level to uncover more granular patterns. County-level Census data was joined with incident records on geoid, and the top 15 cities by raw incident count in 2017 were identified.
Incidents vs. Deaths Per Capita — Two side-by-side bar charts compared the top 15 cities by incidents per 100K and deaths per 100K. A critical observation emerged: the rankings shifted significantly between the two metrics. St. Louis, MO stood out with a 36% lethality rate — over a third of all gun violence incidents in St. Louis resulted in death in 2017. This was far above the average for other top-incident cities, signaling that something beyond incident frequency was driving death rates in certain urban centers.
Census Feature Engineering — Three derived features were added to the city-level dataset to enrich the correlation analysis:
- rent_percentage: proportion of median annual rent cost to median annual income
- home_percentage: proportion of median annual mortgage cost to median annual income
- prop_deaths: proportion of gun violence incidents producing a death
Correlation Matrices — Custom correlation matrices were constructed using ggcorrplot, isolating features correlated with two distinct targets: incidents per 100K and deaths per 100K.
For incidents per 100K, the strongest signals came from population, deaths per 100K, and the proportion of deaths — essentially tautological relationships. A small signal emerged from gender demographics (lower male proportion correlating with slightly fewer incidents). Critically, poverty showed near-zero correlation with incident rates.
For deaths per 100K, a starkly different picture emerged. Poverty showed moderate positive correlation with death rates, and the derived cost-of-living scores (rent_percentage, home_percentage) mirrored this as functions of median income and poverty proportion. This divergence between the two correlation profiles became the central analytical thread.
Plotly Interactive Scatter Plots — Two interactive scatter plots with custom HTML tooltips (city name, rate value, poverty proportion) and linear regression overlays made the poverty–lethality split visceral. The incidents vs. poverty plot showed a nearly flat regression line — poverty has almost no predictive power for whether gun violence occurs. The deaths vs. poverty plot showed a steep positive regression line — poverty strongly predicts whether those incidents are lethal. Four cities above 20% poverty threshold drove the relationship, motivating formal hypothesis testing.
STATISTICAL INFERENCE
The visual evidence demanded formal validation. The analysis proceeded through a rigorous three-stage inferential pipeline: parametric testing, non-parametric permutation testing, and bootstrap estimation.
Hypothesis — Based on the observed correlation between poverty and death rates, a one-sided hypothesis was formulated:
- H₀: μ_high ≤ μ_low (high-poverty cities have equal or lower death rates)
- Hₐ: μ_high > μ_low (high-poverty cities have higher death rates)
The top 15 cities were divided into low and high poverty groups based on the median poverty proportion threshold. High-poverty cities had a mean death rate of 29.9 per 100K; low-poverty cities averaged 11.2 per 100K.
Assumption Checking — Before parametric testing, both groups were assessed for normality via the Shapiro-Wilk test. Low-poverty cities: W = 0.89, p = 0.22 (fail to reject normality). High-poverty cities: W = 0.86, p = 0.16 (fail to reject normality). Both groups passed, though the small sample size (n = 15 total) warranted caution. Variance comparison revealed severe inequality: high-poverty variance = 485.5, low-poverty variance = 15.8 — a ~30:1 ratio, violating the equal-variance assumption and necessitating Welch's correction.
Welch's t-test — With unequal variances, a Welch's t-test was performed: t = 2.22, df = 6.34, p = 0.033. The p-value falls below the α = 0.05 threshold, providing statistically significant evidence that high-poverty cities experience higher gun violence death rates. The one-sided confidence interval for the mean difference started at 2.49 deaths per 100K.
Permutation Test — To validate the parametric result without distributional assumptions, a 1,000-permutation Monte Carlo test (seed = 1986) was conducted. In each permutation, the poverty-level labels were shuffled randomly among the 15 cities, and a Welch's t-statistic was computed. The observed t-statistic exceeded the permuted values in all but 11 of 1,000 permutations, yielding p = 0.011 — stronger than the parametric result. A histogram of the null distribution with the observed t-statistic marked as a dashed vertical line showed it falling well into the rejection region, past the 95th percentile critical value.
Bootstrap Confidence Interval — Finally, a 10,000-sample non-parametric bootstrap estimated the sampling distribution of the median death rate in high-poverty cities. Resampling with replacement from the 7 high-poverty city observations produced a distribution of medians, with the 2.5th and 97.5th percentiles defining a 95% confidence interval of [11.6, 48.0] deaths per 100K. This wide interval reflects the high variance within the high-poverty group but confirms that even the lower bound substantially exceeds the low-poverty group mean of 11.2.
KEY FINDINGS
The Central Finding: Poverty Predicts Lethality, Not Incidence — This is the most important result of the analysis. Poverty has near-zero correlation with whether gun violence occurs — but it has moderate positive correlation with whether people die from it. The scatter plots make the distinction stark: a flat regression line for incidents vs. poverty, a steep positive line for deaths vs. poverty. This finding, validated by three independent statistical tests (Welch's t-test p = 0.033, permutation p = 0.011, bootstrap CI [11.6, 48.0]), reframes how we think about the relationship between socioeconomic conditions and gun violence outcomes.
The Healthcare Infrastructure Hypothesis — If poverty doesn't cause more gun violence but does cause more gun violence deaths, the mechanism is likely downstream of the incident itself. One plausible explanation: impoverished communities have fewer hospitals, poorer emergency medical infrastructure, and longer emergency response times. A gunshot wound that is survivable in a well-resourced city becomes lethal in a medically underserved one. This hypothesis — that poverty kills through healthcare deprivation rather than through violence generation — has direct policy implications for where to target intervention resources.
The Gun Ownership Paradox — Many of the states with the deadliest cities had some of the lowest gun ownership rates in the country. Alaska (64.5% ownership) leads in per-capita incidents, but Delaware (34.4%) and Illinois (27.8%) also rank high despite low ownership. Meanwhile, states with high ownership rates don't necessarily appear in the top incident rankings. This inconsistency suggests that gun ownership rate alone is a poor predictor of gun violence, and that the relationship between access, ownership, and violence is far more nuanced than either side of the policy debate typically acknowledges.
Per-Capita Normalization Changes Everything — Raw incident counts are one of the most misleading statistics in gun violence discourse. Illinois appears to have catastrophic gun violence — until you adjust for its 12.7 million population. Alaska and Delaware, rarely mentioned in gun violence conversations, emerge as the most incident-prone states per capita. DC, as a federal district functioning as a single city, is an extreme outlier that distorts any state-level comparison. The lesson: always normalize by population before drawing conclusions.
Personal Reflection — The author is from Baltimore, Maryland — the 2nd deadliest city in the 2017 data. This personal connection adds authentic perspective: the statistics are not abstract. The analysis concludes with a call for more careful research and targeted interventions, arguing that treating gun violence deaths as the output variable (rather than addressing the input conditions — poverty, healthcare access, community infrastructure) perpetuates the cycle rather than breaking it.
Tech Stack
Details
Team
Steve Meadows
Course
STA 418/518 — Intro to Statistics
Timeline
Summer 2024