Air Quality Data Analysis¶

This notebook analyzes air quality data for New York City, focusing on Nitrogen dioxide (NO2) and Fine particles (PM 2.5). It explores trends over time, seasonal variations, geographical differences, and potential anomalies based on the provided dataset.

In [1]:

# Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings

warnings.filterwarnings('ignore')

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [2]:

# Load Data

# Assuming 'air_quality.csv' is in the same directory as the notebook
try:
    df = pd.read_csv('air_quality.csv')
    print("CSV loaded successfully.")
    # Display basic info and first few rows
    print(df.info())
    display(df.head())
except FileNotFoundError:
    print("Error: 'air_quality.csv' not found. Please ensure the file is in the correct directory.")
except Exception as e:
    print(f"An error occurred while loading the CSV: {e}")

CSV loaded successfully.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       18862 non-null  int64  
 1   Indicator ID    18862 non-null  int64  
 2   Name            18862 non-null  object 
 3   Measure         18862 non-null  object 
 4   Measure Info    18862 non-null  object 
 5   Geo Type Name   18862 non-null  object 
 6   Geo Join ID     18862 non-null  int64  
 7   Geo Place Name  18862 non-null  object 
 8   Time Period     18862 non-null  object 
 9   Start_Date      18862 non-null  object 
 10  Data Value      18862 non-null  float64
 11  Message         0 non-null      float64
dtypes: float64(2), int64(3), object(7)
memory usage: 1.7+ MB
None

	Unique ID	Indicator ID	Name	Measure	Measure Info	Geo Type Name	Geo Join ID	Geo Place Name	Time Period	Start_Date	Data Value	Message
0	336867	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	407	Flushing and Whitestone (CD7)	Winter 2014-15	12/01/2014	23.97	NaN
1	336741	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	107	Upper West Side (CD7)	Winter 2014-15	12/01/2014	27.42	NaN
2	550157	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	414	Rockaway and Broad Channel (CD14)	Annual Average 2017	01/01/2017	12.55	NaN
3	412802	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	407	Flushing and Whitestone (CD7)	Winter 2015-16	12/01/2015	22.63	NaN
4	412803	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	407	Flushing and Whitestone (CD7)	Summer 2016	06/01/2016	14.00	NaN

Data Cleaning and Preparation¶

Convert 'Start_Date' to datetime objects and extract year and season for easier analysis. Clean column names.

In [3]:

# Data Cleaning and Preparation

if 'df' in locals(): # Check if df was loaded successfully
    # Clean column names (remove leading/trailing spaces, replace spaces/hyphens with underscores)
    df.columns = df.columns.str.strip().str.replace(' ', '_').str.replace('-', '_')

    # Convert Start_Date to datetime
    df['Start_Date'] = pd.to_datetime(df['Start_Date'])

    # Extract Year based on the Time_Period column for annual averages
    # For Winter YYYY-YY, use the latter year. For Summer YYYY, use YYYY. For Annual YYYY, use YYYY.
    def get_analysis_year(row):
        period = str(row['Time_Period'])
        start_year = row['Start_Date'].year
        if 'Winter' in period and '-' in period:
            try:
                return int(period.split('-')[1]) + 2000 # Assumes YY format like 14-15
            except:
                 return start_year + 1 # Fallback if format is unexpected
        elif 'Annual Average' in period:
             # If start date is Dec, it's likely for the *next* year's average
             if row['Start_Date'].month == 12:
                 return start_year + 1
             else:
                 return start_year
        else: # Summer or other cases where start date year is usually correct
            return start_year

    df['Analysis_Year'] = df.apply(get_analysis_year, axis=1)

    # Extract Season from Time Period
    def get_season(time_period):
        period_str = str(time_period).lower()
        if 'winter' in period_str:
            return 'Winter'
        elif 'summer' in period_str:
            return 'Summer'
        elif 'annual' in period_str:
            return 'Annual'
        else:
            return 'Other'

    df['Season'] = df['Time_Period'].apply(get_season)

    # Convert Data_Value to numeric, coercing errors
    df['Data_Value'] = pd.to_numeric(df['Data_Value'], errors='coerce')
    # Drop rows where Data_Value could not be converted
    original_rows = len(df)
    df.dropna(subset=['Data_Value'], inplace=True)
    if original_rows > len(df):
        print(f"Dropped {original_rows - len(df)} rows with non-numeric Data_Value.")

    # Display cleaned info and sample
    print("\nCleaned DataFrame Info:")
    print(df.info())
    display(df[['Start_Date', 'Analysis_Year', 'Season', 'Time_Period', 'Data_Value']].head())
else:
    print("DataFrame 'df' not available for cleaning.")

Cleaned DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Unique_ID       18862 non-null  int64         
 1   Indicator_ID    18862 non-null  int64         
 2   Name            18862 non-null  object        
 3   Measure         18862 non-null  object        
 4   Measure_Info    18862 non-null  object        
 5   Geo_Type_Name   18862 non-null  object        
 6   Geo_Join_ID     18862 non-null  int64         
 7   Geo_Place_Name  18862 non-null  object        
 8   Time_Period     18862 non-null  object        
 9   Start_Date      18862 non-null  datetime64[ns]
 10  Data_Value      18862 non-null  float64       
 11  Message         0 non-null      float64       
 12  Analysis_Year   18862 non-null  int64         
 13  Season          18862 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(4), object(7)
memory usage: 2.0+ MB
None

	Start_Date	Analysis_Year	Season	Time_Period	Data_Value
0	2014-12-01	2015	Winter	Winter 2014-15	23.97
1	2014-12-01	2015	Winter	Winter 2014-15	27.42
2	2017-01-01	2017	Annual	Annual Average 2017	12.55
3	2015-12-01	2016	Winter	Winter 2015-16	22.63
4	2016-06-01	2016	Summer	Summer 2016	14.00

Overview of Measured Indicators¶

Identify the different air quality indicators and their frequency in the dataset.

In [4]:

# Overview of Indicators

if 'df' in locals():
    indicator_counts = df['Name'].value_counts()
    print("Indicators Measured:")
    print(indicator_counts)

    plt.figure(figsize=(10, max(6, len(indicator_counts) * 0.5))) # Adjust height based on number of indicators
    sns.barplot(y=indicator_counts.index, x=indicator_counts.values, palette='viridis', orient='h')
    plt.title('Number of Records per Indicator')
    plt.xlabel('Number of Records')
    plt.ylabel('Indicator Name')
    plt.tight_layout()
    plt.show()
else:
    print("DataFrame 'df' not available for indicator overview.")

Indicators Measured:
Name
Nitrogen dioxide (NO2)                                    6345
Fine particles (PM 2.5)                                   6345
Ozone (O3)                                                2115
Asthma emergency departments visits due to Ozone           480
Asthma emergency department visits due to PM2.5            480
Asthma hospitalizations due to Ozone                       480
Annual vehicle miles traveled                              321
Annual vehicle miles traveled (trucks)                     321
Annual vehicle miles traveled (cars)                       321
Deaths due to PM2.5                                        240
Cardiovascular hospitalizations due to PM2.5 (age 40+)     240
Cardiac and respiratory deaths due to Ozone                240
Respiratory hospitalizations due to PM2.5 (age 20+)        240
Outdoor Air Toxics - Benzene                               203
Outdoor Air Toxics - Formaldehyde                          203
Boiler Emissions- Total NOx Emissions                       96
Boiler Emissions- Total SO2 Emissions                       96
Boiler Emissions- Total PM2.5 Emissions                     96
Name: count, dtype: int64

No description has been provided for this image

Analysis of NO2 and PM2.5¶

Filter the dataset to focus on the most frequent pollutant indicators: Nitrogen dioxide (NO2) and Fine particles (PM 2.5).

In [5]:

# Filter for NO2 and PM2.5

if 'df' in locals():
    df_pollutants = df[df['Name'].isin(['Nitrogen dioxide (NO2)', 'Fine particles (PM 2.5)'])].copy()
    print(f"Filtered DataFrame shape for NO2 and PM2.5: {df_pollutants.shape}")
    display(df_pollutants.head())
else:
     print("DataFrame 'df' not available for filtering.")

Filtered DataFrame shape for NO2 and PM2.5: (12690, 14)

	Unique_ID	Indicator_ID	Name	Measure	Measure_Info	Geo_Type_Name	Geo_Join_ID	Geo_Place_Name	Time_Period	Start_Date	Data_Value	Message	Analysis_Year	Season
0	336867	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	407	Flushing and Whitestone (CD7)	Winter 2014-15	2014-12-01	23.97	NaN	2015	Winter
1	336741	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	107	Upper West Side (CD7)	Winter 2014-15	2014-12-01	27.42	NaN	2015	Winter
2	550157	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	414	Rockaway and Broad Channel (CD14)	Annual Average 2017	2017-01-01	12.55	NaN	2017	Annual
3	412802	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	407	Flushing and Whitestone (CD7)	Winter 2015-16	2015-12-01	22.63	NaN	2016	Winter
4	412803	375	Nitrogen dioxide (NO2)	Mean	ppb	CD	407	Flushing and Whitestone (CD7)	Summer 2016	2016-06-01	14.00	NaN	2016	Summer

Temporal Trends (Annual Average)¶

Visualize the trend of annual average NO2 and PM2.5 concentrations over the years across NYC. We average the values across all available locations for each year and pollutant.

In [6]:

# Temporal Trends (Annual Average)

if 'df_pollutants' in locals():
    annual_avg = df_pollutants[df_pollutants['Season'] == 'Annual']

    if not annual_avg.empty:
        # Calculate mean and std dev across locations for each year/pollutant
        annual_trend = annual_avg.groupby(['Analysis_Year', 'Name'])['Data_Value'].agg(['mean', 'std']).reset_index()

        plt.figure(figsize=(14, 7))
        for pollutant in annual_trend['Name'].unique():
            pollutant_data = annual_trend[annual_trend['Name'] == pollutant]
            plt.errorbar(pollutant_data['Analysis_Year'], pollutant_data['mean'], yerr=pollutant_data['std'], 
                         label=pollutant, marker='o', capsize=5)

        plt.title('NYC Annual Average Pollutant Levels Over Time (Mean ± SD across locations)')
        plt.xlabel('Year')
        plt.ylabel('Concentration (Units vary by pollutant)')
        plt.legend(title='Pollutant')
        plt.xticks(rotation=45)
        # Ensure x-axis shows integer years if possible
        plt.gca().xaxis.set_major_locator(plt.MaxNLocator(integer=True))
        plt.tight_layout()
        plt.show()
        print("Insight: General decreasing trend observed for both pollutants.")
    else:
        print("No 'Annual' season data found for NO2 or PM2.5.")
else:
    print("Filtered DataFrame 'df_pollutants' not available.")

Insight: General decreasing trend observed for both pollutants.

Seasonal Variation¶

Compare pollutant levels between Winter and Summer periods using box plots.

In [7]:

# Seasonal Variation

if 'df_pollutants' in locals():
    seasonal_data = df_pollutants[df_pollutants['Season'].isin(['Winter', 'Summer'])]

    if not seasonal_data.empty:
        plt.figure(figsize=(12, 7))
        sns.boxplot(data=seasonal_data, x='Name', y='Data_Value', hue='Season', palette='coolwarm')
        plt.title('Seasonal Variation of NO2 and PM2.5 Levels')
        plt.xlabel('Pollutant')
        plt.ylabel('Concentration (Units vary)')
        # Consider log scale if distributions are highly skewed
        # plt.yscale('log') 
        plt.legend(title='Season')
        plt.tight_layout()
        plt.show()
        print("Insight: Concentrations are generally higher in Winter than in Summer.")
    else:
        print("No 'Winter' or 'Summer' season data found for NO2 or PM2.5.")
else:
    print("Filtered DataFrame 'df_pollutants' not available.")

Insight: Concentrations are generally higher in Winter than in Summer.

Geographical Variation (Community Districts)¶

Examine the average annual pollutant levels across different Community Districts (CDs) for the most recent year available in the annual data.

In [8]:

# Geographical Variation (Community Districts - Annual Average for a recent year)

if 'df_pollutants' in locals() and 'annual_avg' in locals() and not annual_avg.empty:
    # Find the most recent year with 'Annual' data
    recent_annual_year = annual_avg['Analysis_Year'].max()
    geo_data_recent = annual_avg[(annual_avg['Analysis_Year'] == recent_annual_year) & (annual_avg['Geo_Type_Name'] == 'CD')]

    if not geo_data_recent.empty:
        # Separate plots for NO2 and PM2.5 due to different scales
        pollutants_to_plot = geo_data_recent['Name'].unique()
        num_pollutants = len(pollutants_to_plot)
        
        fig, axes = plt.subplots(num_pollutants, 1, figsize=(15, 8 * num_pollutants), sharex=False)
        if num_pollutants == 1:
            axes = [axes] # Make it iterable if only one pollutant
            
        fig.suptitle(f'Average Annual Pollutant Levels by Community District ({int(recent_annual_year)})', fontsize=16, y=1.02)

        for i, pollutant in enumerate(pollutants_to_plot):
            ax = axes[i]
            data_subset = geo_data_recent[geo_data_recent['Name'] == pollutant].sort_values('Data_Value', ascending=False)
            
            sns.barplot(data=data_subset, y='Geo_Place_Name', x='Data_Value', ax=ax, palette='coolwarm', orient='h')
            ax.set_title(pollutant)
            ax.set_xlabel(f'Concentration ({data_subset["Measure_Info"].iloc[0] if not data_subset.empty else ""})')
            ax.set_ylabel('Community District')
            
        plt.tight_layout()
        plt.show()
        print("Insight: Shows variation in pollutant levels between different Community Districts.")
    else:
        print(f"No annual Community District (CD) data found for the most recent year ({int(recent_annual_year)}).")
elif 'df_pollutants' not in locals():
     print("Filtered DataFrame 'df_pollutants' not available.")
else:
    print("No annual average data available for geographical analysis.")

Insight: Shows variation in pollutant levels between different Community Districts.

Analysis Insights (Provided)¶

The dataset primarily tracks Nitrogen dioxide (NO2, ID 375) and Fine particles (PM 2.5, ID 365), with some data on Ozone (O3, ID 386), SO2 emissions (ID 640), and related health outcomes (Asthma ED visits - ID 657, Respiratory hospitalizations - ID 650).
Data spans various NYC geographical units (CD, UHF34, UHF42, Borough) and time periods (Annual Average, Winter, Summer) from approximately 2008 to 2023.
A clear seasonal trend is observed for NO2 and PM 2.5, with concentrations generally higher during Winter periods compared to Summer periods for the same location and year.
Ozone (O3) measurements are predominantly available for Summer periods, aligning with its photochemical formation process.
Geographical disparities are noticeable. Areas like Upper West Side (CD7), South Bronx (UHF 105106107), and Midtown (CD5) often exhibit higher NO2/PM2.5 levels, while coastal areas like Rockaway (CD14) and South Beach - Tottenville (UHF 504) tend to show lower concentrations.
Analysis of locations with longer data series (e.g., Rockaway and Broad Channel CD14, Upper West Side CD7) suggests a general decreasing trend in annual average NO2 and PM 2.5 levels over the available years, indicating potential air quality improvements.
Indicators 640 (SO2 Emissions), 657 (Asthma ED Visits), and 650 (Respiratory Hospitalizations) use different units and represent different types of metrics (emissions density, health rates) compared to direct pollutant concentrations, requiring separate interpretation.

Potential Anomalies (Provided)¶

The 'Start_Date' convention for 'Annual Average YYYY' periods often uses '12/31/YYYY-1' or '01/01/YYYY'. This seems consistent but could be misinterpreted if not considering the 'Time Period' column. (Note: The cleaning step attempted to address this by creating 'Analysis_Year').
No obvious extreme data entry errors were detected in pollutant values (NO2, PM2.5, O3). Higher values observed in earlier years (e.g., 2008-2011) compared to recent years reflect historical trends rather than data errors.
The dataset mixes different measurement units (ppb, mcg/m3, number per km2, rate per 100,000) across indicators, making direct comparison between all indicators inappropriate without normalization or careful consideration.

Predictive Analysis (Provided)¶

Status: Not Applicable
Reason: Predictive analysis is not feasible due to data limitations. The dataset is sparse, with inconsistent time series for many location-indicator combinations. Measurements are spread across different geographical units and time granularities (Annual, Winter, Summer) without sufficient continuous data points for reliable forecasting.

Analysis Summary (Provided)¶

The provided air quality data for NYC highlights significant seasonal and geographical variations for pollutants like NO2 and PM 2.5, with higher levels typically in winter and denser urban areas. A general trend of decreasing concentrations is observable from 2008-2023 in areas with sufficient data. The dataset also includes SO2 emissions and health outcome data, though less frequently measured. Data sparsity and inconsistent time series prevent reliable predictive modeling.