A Temporal Analysis of Deaths in 122 U.S Cities

A Temporal Analysis of Deaths in 122 U.S Cities
Author: Christian Omar Rosado
Github: cor215
 
Introduction and Research Question:
 
Due to my interest in public health and healthcare informatics, I decided to unearth insights from major U.S cities and their public health records. 
 
During my search, I found a dataset containing deaths over time for 122 U.S. cities (data.gov). High-level exploration of this dataset sparked the question “do trends in deaths exist and can we tie them to significant causes?” 
 
Upon further exploration of the data set, a visualization of deaths over time by age group allowed me to identify a spike in deaths during the 1980s and 1990s for people ages 25 to 44.
 
After exploring this subgroup, I found death counts varying by city. Some cities showed trends of increasing and decreasing deaths over time, death spikes and drops, and some cities remained steady.  
 
In this project, I sought to find the significant trends in deaths over time for these 122 U.S. cities and any relationships to significant causes.    
 
Methodology:
 
The methods used in this study include time series event detection via a three-sigma threshold (three standard deviations above and below the mean of deaths for each city) and identifying similarities between cities and their death counts via K-Means and DBSCAN clustering. 

Exploratory Phase: 

I began by visualizing deaths over time for each city. Preliminary findings include a two cities with spikes and drops in deaths and a city outlier with higher death counts that I identified as New York city. After some research, I learned New York city has the largest population in the U.S. (about four times the size of the next largest U.S. city). The gold line in figure one shows deaths over time for New York city. I concluded higher death counts for New York city compared to the rest of the cities is reasonable due to its population size.
 
Figure 1: Deaths over time for 122 U.S cities
 
The two spikes shown in figure 1 above are the cities Houston and Saint Louis. After some research of these cities during the periods of the spikes in deaths, I concluded the spikes were most likely due to data collection errors or random and independent causes. I did not find major death causes in news articles for these cities during the time high spikes were recorded.
 
Figure 3: Deaths over time for Houston

Figure 4: Deaths over time for Saint Louis
 
Moving forward, I visualized deaths over time for each age group and discovered an interesting spike in deaths for the 25 to 44 age group.
 
Figure 5:  Deaths over time by age group
 
Zooming into the 25 to 44 age group subset, I identified the spike in deaths to be significant via a three-sigma threshold plotted in red and green using the mean and standard deviation of recorded deaths from 1962 to 1985. The spike is shown below in the shaded red region during the mid-80s and late-90s. I decided to make this subset the focus of my project.  
 
Figure 6: A significant spike in deaths over time for age group 25 to 44.
 
Preprocessing Data of People Ages 25-44 Years Old for Analysis:

To perform a clustering analysis on the 122 time-series, I normalized each series to a scale of 0 to 1, drop NaN values (empty cells), and structured the cleaned values into a pandas data frame which clustering algorithms can easily process.
 
Analysis:
 
Clustering All Deaths for All Cities 
 
Figure 7: Clusters of deaths over time for all cities and all age groups
 
This cluster analysis resulted in three trends (clusters): 1) deaths increasing over time, 2) deaths decreasing over time, and 3) deaths remaining steady over time.
 
Clustering City Deaths from Ages 25-44 
 
Figure 8: Clusters of deaths over time for all cities, ages 25 to 44
 
The cluster analysis above resulted in four trends (clusters). The red cluster clearly shows the spike in deaths identified in figure 6. 
 
Figure 9: Cluster means for deaths over time for all cities and all age groups
 
The clusters below clearly show four trends in city deaths. We see some cities recovered from the 80s and 90s spike and some cities did not. 
 
Figure 10: Cluster means for deaths over time for ages 25 to 44
 
Conclusion:
 
After researching the 80s and 90s decades, I learned U.S. cities were plagued with high crime rates, drug use, and the HIV epidemic. These three factors were major contributions to high death counts during the two decades mentioned above. As a result, President Bill Clinton implemented the 1994 Crime Bill (Violent Crime Control and Law Enforcement Act of 1994).

Via DBSCAN clustering, I identified the two top cities still currently suffering from high death counts in respect to their individual past death counts in the 80s and 90s. The cities identified by the clustering algorithm belong to cluster 2.0 in figure 10. These were Tacoma and Salt Lake City and their deaths over time are visualized below in figures 10 and 11 respectively. 
 
Figure 10: Deaths over time for Tacoma city ages 25 to 44
 
Figure 11: Deaths over time for Salt Lake City ages 25 to 44

Limitations: 
 
Given data availability and the high sample size of the cities I worked with, I did not analyze crime, drug, and infectious disease data in correlation to death counts.can be explored and identify correlations to the spike in deaths identified during the 80s and 90s. I would have also loved to explore relationships between population size, income, and deaths over time. 

If given the opportunity for future analysis, crime, drug, and infectious disease data and ties to the spike in deaths according to these causes can be can be explored further during the 80s and 90s. I would also recommend investigating relationships between population size, income, and deaths over time by city and age group. 
 
Data Source:

Violent Crime Control and Law Enforcement Act of 1994:
 
Code (iPython Notebook):