Analysis of Leading Causes of Death in the United States¶
This notebook analyzes the provided CSV data containing information about the leading causes of death in the United States. The analysis focuses on visualizing the data to gain insights into the distribution of different causes of death across states. We will look at leading causes of deaths across states. Additionally, we will also find out if there are any anomalies in the data.
In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
In [2]:
# Load the CSV data into a pandas DataFrame
df = pd.read_csv('nchs_-_leading_causes_of_death__united_states.csv')
df.head()
Out[2]:
Year | 113 Cause Name | Cause Name | State | Deaths | Age-adjusted Death Rate | |
---|---|---|---|---|---|---|
0 | 2017 | Accidents (unintentional injuries) (V01-X59,Y8... | Unintentional injuries | United States | 169936 | 49.4 |
1 | 2017 | Accidents (unintentional injuries) (V01-X59,Y8... | Unintentional injuries | Alabama | 2703 | 53.8 |
2 | 2017 | Accidents (unintentional injuries) (V01-X59,Y8... | Unintentional injuries | Alaska | 436 | 63.7 |
3 | 2017 | Accidents (unintentional injuries) (V01-X59,Y8... | Unintentional injuries | Arizona | 4184 | 56.2 |
4 | 2017 | Accidents (unintentional injuries) (V01-X59,Y8... | Unintentional injuries | Arkansas | 1625 | 51.8 |
Data Cleaning and Preprocessing¶
In [3]:
# Remove rows with missing values
df = df.dropna()
# Remove the stray entry with year 2007
df = df[df['Year'] != 2007]
# Convert 'Year' to categorical type
df['Year'] = df['Year'].astype('category')
# Display basic information about the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10296 entries, 0 to 10867 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 10296 non-null category 1 113 Cause Name 10296 non-null object 2 Cause Name 10296 non-null object 3 State 10296 non-null object 4 Deaths 10296 non-null int64 5 Age-adjusted Death Rate 10296 non-null float64 dtypes: category(1), float64(1), int64(1), object(3) memory usage: 493.4+ KB
Distribution of Causes of Death¶
In [4]:
# Group by cause of death and sum the number of deaths
cause_deaths = df.groupby('Cause Name')['Deaths'].sum().sort_values(ascending=False)
# Plot the top causes of death
plt.figure(figsize=(12, 6))
cause_deaths.plot(kind='bar')
plt.title('Total Deaths by Cause (2011-2017)')
plt.xlabel('Cause of Death')
plt.ylabel('Number of Deaths')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
States Most Affected by Heart Disease¶
In [5]:
# Filter for Heart Disease
heart_disease_df = df[df['Cause Name'] == 'Heart disease']
# Calculate total deaths per state due to heart disease
state_heart_deaths = heart_disease_df.groupby('State')['Deaths'].sum().sort_values(ascending=False)
# Plotting
plt.figure(figsize=(12, 6))
state_heart_deaths.plot(kind='bar')
plt.title('Total Deaths due to Heart Disease by State (2011-2017)')
plt.xlabel('State')
plt.ylabel('Number of Deaths')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
States Most Affected by Cancer¶
In [6]:
# Filter for Cancer
cancer_df = df[df['Cause Name'] == 'Cancer']
# Calculate total deaths per state due to cancer
state_cancer_deaths = cancer_df.groupby('State')['Deaths'].sum().sort_values(ascending=False)
# Plotting
plt.figure(figsize=(12, 6))
state_cancer_deaths.plot(kind='bar')
plt.title('Total Deaths due to Cancer by State (2011-2017)')
plt.xlabel('State')
plt.ylabel('Number of Deaths')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Analysis of Age-Adjusted Death Rates for Heart Disease by State¶
In [7]:
heart_disease_df = df[df['Cause Name'] == 'Heart disease']
plt.figure(figsize=(12,6))
sns.boxplot(x='State', y='Age-adjusted Death Rate', data=heart_disease_df)
plt.title('Distribution of Age-Adjusted Death Rate for Heart Disease by State (2011-2017)')
plt.xlabel('State')
plt.ylabel('Age-Adjusted Death Rate')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Analysis of Age-Adjusted Death Rates for Cancer by State¶
In [8]:
cancer_df = df[df['Cause Name'] == 'Cancer']
plt.figure(figsize=(12,6))
sns.boxplot(x='State', y='Age-adjusted Death Rate', data=cancer_df)
plt.title('Distribution of Age-Adjusted Death Rate for Cancer by State (2011-2017)')
plt.xlabel('State')
plt.ylabel('Age-Adjusted Death Rate')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Age-Adjusted Death Rates for Alzheimer's Disease by State¶
In [9]:
# Age-Adjusted Death Rates for Alzheimer's Disease by State
alzheimer_df = df[df['Cause Name'] == "Alzheimer's disease"]
plt.figure(figsize=(12, 6))
sns.boxplot(x='State', y='Age-adjusted Death Rate', data=alzheimer_df)
plt.title("Distribution of Age-Adjusted Death Rate for Alzheimer's disease by State (2011-2017)")
plt.xlabel('State')
plt.ylabel('Age-Adjusted Death Rate')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Findings Summary¶
In [10]:
analysis_results = {
"insights": [
"Heart disease is a leading cause of death in the United States, with significant variation in age-adjusted death rates across states.",
"Cancer is another major cause of death, exhibiting regional differences in age-adjusted death rates.",
"Unintentional injuries (accidents) show considerable state-level variation in death rates, potentially reflecting differences in safety regulations, infrastructure, and risk behaviors.",
"Suicide rates vary significantly across states, indicating different socio-economic and mental health challenges.",
"Alzheimer's disease is a significant contributor to mortality, with age-adjusted death rates varying by state.",
"Chronic Lower Respiratory Diseases (CLRD) contribute significantly to deaths across the US, showing substantial state-level variation possibly linked to smoking prevalence and air quality.",
"Diabetes mellitus as a cause of death has a nationwide presence, with some states reporting higher death rates compared to the national average."
],
"anomalies": [
"West Virginia consistently shows a high age-adjusted death rate for unintentional injuries, suggesting potential issues with safety practices.",
"Alaska, Montana, and Wyoming consistently have high suicide rates.",
"District of Columbia has a high age-adjusted death rate for heart disease compared to its cancer mortality rate, relative to other states, which might indicate specific health challenges in that area.",
"The presence of one record with year 2007 in the middle of the 2011-2017 data indicates a potential data entry error."
],
"predictions": {
"status": "No predictions possible",
"reason": "The data only contains one year (2017 and partial entries for earlier years), which is insufficient for time series analysis and predictions. There are no explicit features for building a predictive model for age adjusted death rates for different states for future years."
},
"summary": "The provided CSV data offers a snapshot of leading causes of death in the United States for 2017, with some records also present for 2016, 2015, 2014, 2013, 2012 and 2011, plus a stray entry for 2007. Heart disease and cancer are major contributors to mortality, followed by unintentional injuries, stroke, Alzheimer's, CLRD, and diabetes. Significant state-level disparities exist for each cause, potentially driven by variations in demographics, lifestyle factors, access to healthcare, and environmental conditions."
}
# Print all fields of analysis_results
for key, value in analysis_results.items():
print(f'{key}: {value}')
insights: ['Heart disease is a leading cause of death in the United States, with significant variation in age-adjusted death rates across states.', 'Cancer is another major cause of death, exhibiting regional differences in age-adjusted death rates.', 'Unintentional injuries (accidents) show considerable state-level variation in death rates, potentially reflecting differences in safety regulations, infrastructure, and risk behaviors.', 'Suicide rates vary significantly across states, indicating different socio-economic and mental health challenges.', "Alzheimer's disease is a significant contributor to mortality, with age-adjusted death rates varying by state.", 'Chronic Lower Respiratory Diseases (CLRD) contribute significantly to deaths across the US, showing substantial state-level variation possibly linked to smoking prevalence and air quality.', 'Diabetes mellitus as a cause of death has a nationwide presence, with some states reporting higher death rates compared to the national average.'] anomalies: ['West Virginia consistently shows a high age-adjusted death rate for unintentional injuries, suggesting potential issues with safety practices.', 'Alaska, Montana, and Wyoming consistently have high suicide rates.', 'District of Columbia has a high age-adjusted death rate for heart disease compared to its cancer mortality rate, relative to other states, which might indicate specific health challenges in that area.', 'The presence of one record with year 2007 in the middle of the 2011-2017 data indicates a potential data entry error.'] predictions: {'status': 'No predictions possible', 'reason': 'The data only contains one year (2017 and partial entries for earlier years), which is insufficient for time series analysis and predictions. There are no explicit features for building a predictive model for age adjusted death rates for different states for future years.'} summary: The provided CSV data offers a snapshot of leading causes of death in the United States for 2017, with some records also present for 2016, 2015, 2014, 2013, 2012 and 2011, plus a stray entry for 2007. Heart disease and cancer are major contributors to mortality, followed by unintentional injuries, stroke, Alzheimer's, CLRD, and diabetes. Significant state-level disparities exist for each cause, potentially driven by variations in demographics, lifestyle factors, access to healthcare, and environmental conditions.