Air Quality Data Analysis¶
This notebook analyzes air quality data for New York City, focusing on Nitrogen dioxide (NO2) and Fine particles (PM 2.5). It explores trends over time, seasonal variations, geographical differences, and potential anomalies based on the provided dataset.
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Load Data
# Assuming 'air_quality.csv' is in the same directory as the notebook
try:
df = pd.read_csv('air_quality.csv')
print("CSV loaded successfully.")
# Display basic info and first few rows
print(df.info())
display(df.head())
except FileNotFoundError:
print("Error: 'air_quality.csv' not found. Please ensure the file is in the correct directory.")
except Exception as e:
print(f"An error occurred while loading the CSV: {e}")
CSV loaded successfully. <class 'pandas.core.frame.DataFrame'> RangeIndex: 18862 entries, 0 to 18861 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unique ID 18862 non-null int64 1 Indicator ID 18862 non-null int64 2 Name 18862 non-null object 3 Measure 18862 non-null object 4 Measure Info 18862 non-null object 5 Geo Type Name 18862 non-null object 6 Geo Join ID 18862 non-null int64 7 Geo Place Name 18862 non-null object 8 Time Period 18862 non-null object 9 Start_Date 18862 non-null object 10 Data Value 18862 non-null float64 11 Message 0 non-null float64 dtypes: float64(2), int64(3), object(7) memory usage: 1.7+ MB None
Unique ID | Indicator ID | Name | Measure | Measure Info | Geo Type Name | Geo Join ID | Geo Place Name | Time Period | Start_Date | Data Value | Message | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 336867 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 407 | Flushing and Whitestone (CD7) | Winter 2014-15 | 12/01/2014 | 23.97 | NaN |
1 | 336741 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 107 | Upper West Side (CD7) | Winter 2014-15 | 12/01/2014 | 27.42 | NaN |
2 | 550157 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 414 | Rockaway and Broad Channel (CD14) | Annual Average 2017 | 01/01/2017 | 12.55 | NaN |
3 | 412802 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 407 | Flushing and Whitestone (CD7) | Winter 2015-16 | 12/01/2015 | 22.63 | NaN |
4 | 412803 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 407 | Flushing and Whitestone (CD7) | Summer 2016 | 06/01/2016 | 14.00 | NaN |
Data Cleaning and Preparation¶
Convert 'Start_Date' to datetime objects and extract year and season for easier analysis. Clean column names.
# Data Cleaning and Preparation
if 'df' in locals(): # Check if df was loaded successfully
# Clean column names (remove leading/trailing spaces, replace spaces/hyphens with underscores)
df.columns = df.columns.str.strip().str.replace(' ', '_').str.replace('-', '_')
# Convert Start_Date to datetime
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
# Extract Year based on the Time_Period column for annual averages
# For Winter YYYY-YY, use the latter year. For Summer YYYY, use YYYY. For Annual YYYY, use YYYY.
def get_analysis_year(row):
period = str(row['Time_Period'])
start_year = row['Start_Date'].year
if 'Winter' in period and '-' in period:
try:
return int(period.split('-')[1]) + 2000 # Assumes YY format like 14-15
except:
return start_year + 1 # Fallback if format is unexpected
elif 'Annual Average' in period:
# If start date is Dec, it's likely for the *next* year's average
if row['Start_Date'].month == 12:
return start_year + 1
else:
return start_year
else: # Summer or other cases where start date year is usually correct
return start_year
df['Analysis_Year'] = df.apply(get_analysis_year, axis=1)
# Extract Season from Time Period
def get_season(time_period):
period_str = str(time_period).lower()
if 'winter' in period_str:
return 'Winter'
elif 'summer' in period_str:
return 'Summer'
elif 'annual' in period_str:
return 'Annual'
else:
return 'Other'
df['Season'] = df['Time_Period'].apply(get_season)
# Convert Data_Value to numeric, coercing errors
df['Data_Value'] = pd.to_numeric(df['Data_Value'], errors='coerce')
# Drop rows where Data_Value could not be converted
original_rows = len(df)
df.dropna(subset=['Data_Value'], inplace=True)
if original_rows > len(df):
print(f"Dropped {original_rows - len(df)} rows with non-numeric Data_Value.")
# Display cleaned info and sample
print("\nCleaned DataFrame Info:")
print(df.info())
display(df[['Start_Date', 'Analysis_Year', 'Season', 'Time_Period', 'Data_Value']].head())
else:
print("DataFrame 'df' not available for cleaning.")
Cleaned DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 18862 entries, 0 to 18861 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unique_ID 18862 non-null int64 1 Indicator_ID 18862 non-null int64 2 Name 18862 non-null object 3 Measure 18862 non-null object 4 Measure_Info 18862 non-null object 5 Geo_Type_Name 18862 non-null object 6 Geo_Join_ID 18862 non-null int64 7 Geo_Place_Name 18862 non-null object 8 Time_Period 18862 non-null object 9 Start_Date 18862 non-null datetime64[ns] 10 Data_Value 18862 non-null float64 11 Message 0 non-null float64 12 Analysis_Year 18862 non-null int64 13 Season 18862 non-null object dtypes: datetime64[ns](1), float64(2), int64(4), object(7) memory usage: 2.0+ MB None
Start_Date | Analysis_Year | Season | Time_Period | Data_Value | |
---|---|---|---|---|---|
0 | 2014-12-01 | 2015 | Winter | Winter 2014-15 | 23.97 |
1 | 2014-12-01 | 2015 | Winter | Winter 2014-15 | 27.42 |
2 | 2017-01-01 | 2017 | Annual | Annual Average 2017 | 12.55 |
3 | 2015-12-01 | 2016 | Winter | Winter 2015-16 | 22.63 |
4 | 2016-06-01 | 2016 | Summer | Summer 2016 | 14.00 |
Overview of Measured Indicators¶
Identify the different air quality indicators and their frequency in the dataset.
# Overview of Indicators
if 'df' in locals():
indicator_counts = df['Name'].value_counts()
print("Indicators Measured:")
print(indicator_counts)
plt.figure(figsize=(10, max(6, len(indicator_counts) * 0.5))) # Adjust height based on number of indicators
sns.barplot(y=indicator_counts.index, x=indicator_counts.values, palette='viridis', orient='h')
plt.title('Number of Records per Indicator')
plt.xlabel('Number of Records')
plt.ylabel('Indicator Name')
plt.tight_layout()
plt.show()
else:
print("DataFrame 'df' not available for indicator overview.")
Indicators Measured: Name Nitrogen dioxide (NO2) 6345 Fine particles (PM 2.5) 6345 Ozone (O3) 2115 Asthma emergency departments visits due to Ozone 480 Asthma emergency department visits due to PM2.5 480 Asthma hospitalizations due to Ozone 480 Annual vehicle miles traveled 321 Annual vehicle miles traveled (trucks) 321 Annual vehicle miles traveled (cars) 321 Deaths due to PM2.5 240 Cardiovascular hospitalizations due to PM2.5 (age 40+) 240 Cardiac and respiratory deaths due to Ozone 240 Respiratory hospitalizations due to PM2.5 (age 20+) 240 Outdoor Air Toxics - Benzene 203 Outdoor Air Toxics - Formaldehyde 203 Boiler Emissions- Total NOx Emissions 96 Boiler Emissions- Total SO2 Emissions 96 Boiler Emissions- Total PM2.5 Emissions 96 Name: count, dtype: int64
Analysis of NO2 and PM2.5¶
Filter the dataset to focus on the most frequent pollutant indicators: Nitrogen dioxide (NO2) and Fine particles (PM 2.5).
# Filter for NO2 and PM2.5
if 'df' in locals():
df_pollutants = df[df['Name'].isin(['Nitrogen dioxide (NO2)', 'Fine particles (PM 2.5)'])].copy()
print(f"Filtered DataFrame shape for NO2 and PM2.5: {df_pollutants.shape}")
display(df_pollutants.head())
else:
print("DataFrame 'df' not available for filtering.")
Filtered DataFrame shape for NO2 and PM2.5: (12690, 14)
Unique_ID | Indicator_ID | Name | Measure | Measure_Info | Geo_Type_Name | Geo_Join_ID | Geo_Place_Name | Time_Period | Start_Date | Data_Value | Message | Analysis_Year | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 336867 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 407 | Flushing and Whitestone (CD7) | Winter 2014-15 | 2014-12-01 | 23.97 | NaN | 2015 | Winter |
1 | 336741 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 107 | Upper West Side (CD7) | Winter 2014-15 | 2014-12-01 | 27.42 | NaN | 2015 | Winter |
2 | 550157 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 414 | Rockaway and Broad Channel (CD14) | Annual Average 2017 | 2017-01-01 | 12.55 | NaN | 2017 | Annual |
3 | 412802 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 407 | Flushing and Whitestone (CD7) | Winter 2015-16 | 2015-12-01 | 22.63 | NaN | 2016 | Winter |
4 | 412803 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | CD | 407 | Flushing and Whitestone (CD7) | Summer 2016 | 2016-06-01 | 14.00 | NaN | 2016 | Summer |
Temporal Trends (Annual Average)¶
Visualize the trend of annual average NO2 and PM2.5 concentrations over the years across NYC. We average the values across all available locations for each year and pollutant.
# Temporal Trends (Annual Average)
if 'df_pollutants' in locals():
annual_avg = df_pollutants[df_pollutants['Season'] == 'Annual']
if not annual_avg.empty:
# Calculate mean and std dev across locations for each year/pollutant
annual_trend = annual_avg.groupby(['Analysis_Year', 'Name'])['Data_Value'].agg(['mean', 'std']).reset_index()
plt.figure(figsize=(14, 7))
for pollutant in annual_trend['Name'].unique():
pollutant_data = annual_trend[annual_trend['Name'] == pollutant]
plt.errorbar(pollutant_data['Analysis_Year'], pollutant_data['mean'], yerr=pollutant_data['std'],
label=pollutant, marker='o', capsize=5)
plt.title('NYC Annual Average Pollutant Levels Over Time (Mean ± SD across locations)')
plt.xlabel('Year')
plt.ylabel('Concentration (Units vary by pollutant)')
plt.legend(title='Pollutant')
plt.xticks(rotation=45)
# Ensure x-axis shows integer years if possible
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(integer=True))
plt.tight_layout()
plt.show()
print("Insight: General decreasing trend observed for both pollutants.")
else:
print("No 'Annual' season data found for NO2 or PM2.5.")
else:
print("Filtered DataFrame 'df_pollutants' not available.")
Insight: General decreasing trend observed for both pollutants.
Seasonal Variation¶
Compare pollutant levels between Winter and Summer periods using box plots.
# Seasonal Variation
if 'df_pollutants' in locals():
seasonal_data = df_pollutants[df_pollutants['Season'].isin(['Winter', 'Summer'])]
if not seasonal_data.empty:
plt.figure(figsize=(12, 7))
sns.boxplot(data=seasonal_data, x='Name', y='Data_Value', hue='Season', palette='coolwarm')
plt.title('Seasonal Variation of NO2 and PM2.5 Levels')
plt.xlabel('Pollutant')
plt.ylabel('Concentration (Units vary)')
# Consider log scale if distributions are highly skewed
# plt.yscale('log')
plt.legend(title='Season')
plt.tight_layout()
plt.show()
print("Insight: Concentrations are generally higher in Winter than in Summer.")
else:
print("No 'Winter' or 'Summer' season data found for NO2 or PM2.5.")
else:
print("Filtered DataFrame 'df_pollutants' not available.")
Insight: Concentrations are generally higher in Winter than in Summer.
Geographical Variation (Community Districts)¶
Examine the average annual pollutant levels across different Community Districts (CDs) for the most recent year available in the annual data.
# Geographical Variation (Community Districts - Annual Average for a recent year)
if 'df_pollutants' in locals() and 'annual_avg' in locals() and not annual_avg.empty:
# Find the most recent year with 'Annual' data
recent_annual_year = annual_avg['Analysis_Year'].max()
geo_data_recent = annual_avg[(annual_avg['Analysis_Year'] == recent_annual_year) & (annual_avg['Geo_Type_Name'] == 'CD')]
if not geo_data_recent.empty:
# Separate plots for NO2 and PM2.5 due to different scales
pollutants_to_plot = geo_data_recent['Name'].unique()
num_pollutants = len(pollutants_to_plot)
fig, axes = plt.subplots(num_pollutants, 1, figsize=(15, 8 * num_pollutants), sharex=False)
if num_pollutants == 1:
axes = [axes] # Make it iterable if only one pollutant
fig.suptitle(f'Average Annual Pollutant Levels by Community District ({int(recent_annual_year)})', fontsize=16, y=1.02)
for i, pollutant in enumerate(pollutants_to_plot):
ax = axes[i]
data_subset = geo_data_recent[geo_data_recent['Name'] == pollutant].sort_values('Data_Value', ascending=False)
sns.barplot(data=data_subset, y='Geo_Place_Name', x='Data_Value', ax=ax, palette='coolwarm', orient='h')
ax.set_title(pollutant)
ax.set_xlabel(f'Concentration ({data_subset["Measure_Info"].iloc[0] if not data_subset.empty else ""})')
ax.set_ylabel('Community District')
plt.tight_layout()
plt.show()
print("Insight: Shows variation in pollutant levels between different Community Districts.")
else:
print(f"No annual Community District (CD) data found for the most recent year ({int(recent_annual_year)}).")
elif 'df_pollutants' not in locals():
print("Filtered DataFrame 'df_pollutants' not available.")
else:
print("No annual average data available for geographical analysis.")
Insight: Shows variation in pollutant levels between different Community Districts.
Analysis Insights (Provided)¶
- The dataset primarily tracks Nitrogen dioxide (NO2, ID 375) and Fine particles (PM 2.5, ID 365), with some data on Ozone (O3, ID 386), SO2 emissions (ID 640), and related health outcomes (Asthma ED visits - ID 657, Respiratory hospitalizations - ID 650).
- Data spans various NYC geographical units (CD, UHF34, UHF42, Borough) and time periods (Annual Average, Winter, Summer) from approximately 2008 to 2023.
- A clear seasonal trend is observed for NO2 and PM 2.5, with concentrations generally higher during Winter periods compared to Summer periods for the same location and year.
- Ozone (O3) measurements are predominantly available for Summer periods, aligning with its photochemical formation process.
- Geographical disparities are noticeable. Areas like Upper West Side (CD7), South Bronx (UHF 105106107), and Midtown (CD5) often exhibit higher NO2/PM2.5 levels, while coastal areas like Rockaway (CD14) and South Beach - Tottenville (UHF 504) tend to show lower concentrations.
- Analysis of locations with longer data series (e.g., Rockaway and Broad Channel CD14, Upper West Side CD7) suggests a general decreasing trend in annual average NO2 and PM 2.5 levels over the available years, indicating potential air quality improvements.
- Indicators 640 (SO2 Emissions), 657 (Asthma ED Visits), and 650 (Respiratory Hospitalizations) use different units and represent different types of metrics (emissions density, health rates) compared to direct pollutant concentrations, requiring separate interpretation.
Potential Anomalies (Provided)¶
- The 'Start_Date' convention for 'Annual Average YYYY' periods often uses '12/31/YYYY-1' or '01/01/YYYY'. This seems consistent but could be misinterpreted if not considering the 'Time Period' column. (Note: The cleaning step attempted to address this by creating 'Analysis_Year').
- No obvious extreme data entry errors were detected in pollutant values (NO2, PM2.5, O3). Higher values observed in earlier years (e.g., 2008-2011) compared to recent years reflect historical trends rather than data errors.
- The dataset mixes different measurement units (ppb, mcg/m3, number per km2, rate per 100,000) across indicators, making direct comparison between all indicators inappropriate without normalization or careful consideration.
Predictive Analysis (Provided)¶
- Status: Not Applicable
- Reason: Predictive analysis is not feasible due to data limitations. The dataset is sparse, with inconsistent time series for many location-indicator combinations. Measurements are spread across different geographical units and time granularities (Annual, Winter, Summer) without sufficient continuous data points for reliable forecasting.
Analysis Summary (Provided)¶
The provided air quality data for NYC highlights significant seasonal and geographical variations for pollutants like NO2 and PM 2.5, with higher levels typically in winter and denser urban areas. A general trend of decreasing concentrations is observable from 2008-2023 in areas with sufficient data. The dataset also includes SO2 emissions and health outcome data, though less frequently measured. Data sparsity and inconsistent time series prevent reliable predictive modeling.