COVID-19 Deaths Analysis by Sex and Age¶
This notebook analyzes provisional COVID-19 deaths data by sex and age group across different states in the United States. The analysis aims to explore patterns, insights, and potential anomalies related to COVID-19 mortality based on demographic factors.
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
Load Data¶
# Load CSV Data
try:
df = pd.read_csv('provisional_covid-19_deaths_by_sex_and_age.csv')
print("Data loaded successfully!")
except FileNotFoundError:
print("Error: CSV file not found. Please ensure 'provisional_covid-19_deaths_by_sex_and_age.csv' is in the correct directory.")
# To make it runnable without the file, we'll create a dummy DataFrame structure (remove in real use)
df = pd.DataFrame(columns=[
'Data As Of', 'Start Date', 'End Date', 'Group', 'Year', 'Month', 'State', 'Sex', 'Age Group',
'COVID-19 Deaths', 'Total Deaths', 'Pneumonia Deaths', 'Pneumonia and COVID-19 Deaths', 'Influenza Deaths',
'Pneumonia, Influenza, or COVID-19 Deaths', 'Footnote'
])
except Exception as e:
print(f"An error occurred while loading the data: {e}")
df = None # Handle cases where loading fails
Data loaded successfully!
Data Cleaning and Preprocessing¶
# Data Cleaning
if df is not None:
# Replace suppressed values with NaN
cols_to_numeric = ['COVID-19 Deaths', 'Total Deaths', 'Pneumonia Deaths', 'Pneumonia and COVID-19 Deaths', 'Influenza Deaths', 'Pneumonia, Influenza, or COVID-19 Deaths']
for col in cols_to_numeric:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Impute NaN values in numeric columns with 0 (or consider other imputation methods if appropriate)
df[cols_to_numeric] = df[cols_to_numeric].fillna(0)
# Remove rows with Footnote (indicating data suppression issues in other columns if needed, or handle based on analysis goal)
df = df[df['Footnote'].isnull()]
df = df.drop(columns=['Footnote']) # Drop Footnote column after filtering
print("Data cleaning and preprocessing complete.")
else:
print("Data cleaning skipped due to loading errors.")
Data cleaning and preprocessing complete.
Data Overview¶
if df is not None:
print("DataFrame Head:")
print(df.head())
print("\nDataFrame Information:")
print(df.info())
print("\nDataFrame Description:")
print(df.describe())
else:
print("Data overview skipped due to loading errors.")
DataFrame Head: Data As Of Start Date End Date Group Year Month State \ 0 09/27/2023 01/01/2020 09/23/2023 By Total NaN NaN United States 1 09/27/2023 01/01/2020 09/23/2023 By Total NaN NaN United States 2 09/27/2023 01/01/2020 09/23/2023 By Total NaN NaN United States 3 09/27/2023 01/01/2020 09/23/2023 By Total NaN NaN United States 4 09/27/2023 01/01/2020 09/23/2023 By Total NaN NaN United States Sex Age Group COVID-19 Deaths Total Deaths Pneumonia Deaths \ 0 All Sexes All Ages 1146774.0 12303399.0 1162844.0 1 All Sexes Under 1 year 519.0 73213.0 1056.0 2 All Sexes 0-17 years 1696.0 130970.0 2961.0 3 All Sexes 1-4 years 285.0 14299.0 692.0 4 All Sexes 5-14 years 509.0 22008.0 818.0 Pneumonia and COVID-19 Deaths Influenza Deaths \ 0 569264.0 22229.0 1 95.0 64.0 2 424.0 509.0 3 66.0 177.0 4 143.0 219.0 Pneumonia, Influenza, or COVID-19 Deaths 0 1760095.0 1 1541.0 2 4716.0 3 1079.0 4 1390.0 DataFrame Information: <class 'pandas.core.frame.DataFrame'> Index: 39804 entries, 0 to 137696 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data As Of 39804 non-null object 1 Start Date 39804 non-null object 2 End Date 39804 non-null object 3 Group 39804 non-null object 4 Year 38499 non-null float64 5 Month 34947 non-null float64 6 State 39804 non-null object 7 Sex 39804 non-null object 8 Age Group 39804 non-null object 9 COVID-19 Deaths 39804 non-null float64 10 Total Deaths 39804 non-null float64 11 Pneumonia Deaths 39804 non-null float64 12 Pneumonia and COVID-19 Deaths 39804 non-null float64 13 Influenza Deaths 39804 non-null float64 14 Pneumonia, Influenza, or COVID-19 Deaths 39804 non-null float64 dtypes: float64(8), object(7) memory usage: 4.9+ MB None DataFrame Description: Year Month COVID-19 Deaths Total Deaths \ count 38499.000000 34947.000000 3.980400e+04 3.980400e+04 mean 2021.225694 6.484419 6.735513e+02 7.215437e+03 std 1.029564 3.412469 9.396709e+03 9.666126e+04 min 2020.000000 1.000000 0.000000e+00 0.000000e+00 25% 2020.000000 4.000000 0.000000e+00 2.200000e+01 50% 2021.000000 7.000000 2.800000e+01 3.090000e+02 75% 2022.000000 9.000000 1.090000e+02 1.357000e+03 max 2023.000000 12.000000 1.146774e+06 1.230340e+07 Pneumonia Deaths Pneumonia and COVID-19 Deaths Influenza Deaths \ count 3.980400e+04 39804.000000 39804.000000 mean 6.814865e+02 336.652924 13.745905 std 9.340308e+03 4733.412587 184.363712 min 0.000000e+00 0.000000 0.000000 25% 0.000000e+00 0.000000 0.000000 50% 3.100000e+01 13.000000 0.000000 75% 1.190000e+02 52.000000 0.000000 max 1.162844e+06 569264.000000 22229.000000 Pneumonia, Influenza, or COVID-19 Deaths count 3.980400e+04 mean 1.030559e+03 std 1.416063e+04 min 0.000000e+00 25% 0.000000e+00 50% 4.700000e+01 75% 1.800000e+02 max 1.760095e+06
COVID-19 Deaths by Age Group (United States - All Sexes)¶
if df is not None:
us_all_sexes_age = df[(df['State'] == 'United States') & (df['Sex'] == 'All Sexes') & (df['Age Group'] != 'All Ages') & (df['Age Group'] != '0-17 years') & (df['Age Group'] != '18-29 years')] #Exclude aggregated age groups for clarity
plt.figure(figsize=(12, 6))
sns.barplot(x='Age Group', y='COVID-19 Deaths', data=us_all_sexes_age, order= ['Under 1 year', '1-4 years', '5-14 years', '15-24 years', '25-34 years', '30-39 years', '35-44 years', '40-49 years', '45-54 years', '55-64 years', '65-74 years', '75-84 years', '85 years and over'])
plt.title('COVID-19 Deaths by Age Group in the United States (All Sexes)')
plt.xlabel('Age Group')
plt.ylabel('COVID-19 Deaths')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
else:
print("Visualization skipped due to loading errors.")
Insight: This bar plot shows the distribution of COVID-19 deaths across different age groups in the United States, considering all sexes. We can observe the trend of deaths increasing with age, particularly prominent in older age groups.
COVID-19 Deaths by Sex (United States - All Ages)¶
if df is not None:
us_sex_all_ages = df[(df['State'] == 'United States') & (df['Age Group'] == 'All Ages') & (df['Sex'] != 'All Sexes')]
plt.figure(figsize=(8, 8))
plt.pie(us_sex_all_ages['COVID-19 Deaths'], labels=us_sex_all_ages['Sex'], autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel')[0:2])
plt.title('COVID-19 Deaths Distribution by Sex in the United States (All Ages)')
plt.ylabel('COVID-19 Deaths')
plt.tight_layout()
plt.show()
else:
print("Visualization skipped due to loading errors.")
Insight: This pie chart illustrates the proportion of COVID-19 deaths by sex for all age groups in the United States. It provides a quick visual comparison of mortality rates between males and females.
Top States by COVID-19 Deaths (All Sexes, All Ages)¶
if df is not None:
state_deaths_all_sexes_all_ages = df[(df['Group'] == 'By Total') & (df['Sex'] == 'All Sexes') & (df['Age Group'] == 'All Ages') & (df['State'] != 'United States')]
top_10_states = state_deaths_all_sexes_all_ages.nlargest(10, 'COVID-19 Deaths')
plt.figure(figsize=(12, 6))
sns.barplot(x='State', y='COVID-19 Deaths', data=top_10_states, palette='viridis')
plt.title('Top 10 States by COVID-19 Deaths (All Sexes, All Ages)')
plt.xlabel('State')
plt.ylabel('COVID-19 Deaths')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
else:
print("Visualization skipped due to loading errors.")
/tmp/ipykernel_159/2996980020.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='State', y='COVID-19 Deaths', data=top_10_states, palette='viridis')
Insight: This bar plot displays the top 10 states with the highest number of COVID-19 deaths across all sexes and age groups. This visualization helps identify the states most severely impacted by the pandemic in terms of mortality count.
Correlation Analysis¶
if df is not None:
correlation_matrix = df[cols_to_numeric].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Death Causes')
plt.tight_layout()
plt.show()
else:
print("Correlation analysis skipped due to loading errors.")
Insight: The heatmap presents the correlation matrix between different causes of death. It helps understand the relationships between COVID-19 deaths, pneumonia deaths, influenza deaths, and total deaths. High positive correlations are expected between 'Pneumonia Deaths', 'COVID-19 Deaths', and 'Pneumonia, Influenza, or COVID-19 Deaths' as they are related and sometimes overlapping categories.