Residential Energy Consumption Analysis (RECS 2009)¶
This Jupyter Notebook performs an exploratory data analysis on the 2009 Residential Energy Consumption Survey (RECS) public use microdata. The purpose of this analysis is to understand various aspects of residential energy consumption, including housing characteristics, appliance ownership, energy usage patterns, and their correlations. We will identify key insights, potential data anomalies, and build a simple predictive model for total energy consumption.
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
import warnings
# Ignore all warnings for cleaner output
warnings.filterwarnings('ignore')
1. Load Data¶
# Load the CSV data into a pandas DataFrame
df = pd.read_csv('recs2009_public.csv')
# Display the first few rows of the DataFrame
print("DataFrame Head:")
print(df.head())
# Display basic information about the DataFrame
print("\nDataFrame Info:")
df.info()
# Display the shape of the DataFrame
print(f"\nDataFrame Shape: {df.shape}")
DataFrame Head: DOEID REGIONC DIVISION REPORTABLE_DOMAIN TYPEHUQ NWEIGHT HDD65 \ 0 1 2 4 12 2 2471.68 4742 1 2 4 10 26 2 8599.17 2662 2 3 1 1 1 5 8969.92 6233 3 4 2 3 7 2 18003.64 6034 4 5 1 1 1 3 5999.61 5388 CDD65 HDD30YR CDD30YR ... SCALEKER IECC_Climate_Pub HDD50 CDD80 \ 0 1080 4953 1271 ... -2 4A 2117 56 1 199 2688 143 ... -2 3C 62 26 2 505 5741 829 ... -2 5A 2346 49 3 672 5781 868 ... -2 5A 2746 0 4 702 5313 797 ... -2 5A 2251 0 GND_HDD65 WSF OA_LAT GWT DesignDBT99 DesignDBT1 0 4250 0.48 6 56 9 96 1 2393 0.61 0 64 38 73 2 5654 0.48 3 52 12 88 3 4941 0.55 4 55 7 87 4 5426 0.61 4 50 13 90 [5 rows x 940 columns] DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 12083 entries, 0 to 12082 Columns: 940 entries, DOEID to DesignDBT1 dtypes: float64(50), int64(885), object(5) memory usage: 86.7+ MB DataFrame Shape: (12083, 940)
2. Data Cleaning and Preparation¶
The dataset uses specific codes for missing or not applicable values (e.g., -9 for 'Not Reported', -2 for 'Not Applicable'). We will replace these with np.nan for consistent handling and convert relevant columns to numeric types.
# Replace placeholder values (-9, -2) with NaN
df.replace({-9: np.nan, -2: np.nan}, inplace=True)
# Convert relevant columns to numeric, coercing errors to NaN
numeric_cols = [
'TYPEHUQ', 'YEARMADE', 'REGIONC', 'Climate_Region_Pub', 'TOTSQFT',
'TOTALBTU', 'TOTALDOL', 'KWH', 'HDD65', 'CDD65', 'NUMFLRS', 'BEDROOMS',
'NCOMBATH', 'NHSLDMEM', 'HHAGE', 'NUMFRIG', 'DISHWASH', 'CWASHER', 'DRYER',
'COMPUTER', 'INTERNET', 'HEATHOME', 'AIRCOND', 'FUELHEAT', 'COOLTYPE', 'KOWNRENT',
'KWHSPH', 'KWHCOL', 'KWHWTH', 'KWHRFG', 'KWHOTH', 'BTUNGSPH', 'BTUNGWTH', 'BTUNGOTH'
]
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
print("Data cleaning complete. Replaced -9 and -2 with NaN and converted relevant columns to numeric.")
print("\nUpdated DataFrame Info (sample of numeric columns):")
df[numeric_cols[:10]].info()
Data cleaning complete. Replaced -9 and -2 with NaN and converted relevant columns to numeric. Updated DataFrame Info (sample of numeric columns): <class 'pandas.core.frame.DataFrame'> RangeIndex: 12083 entries, 0 to 12082 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TYPEHUQ 12083 non-null int64 1 YEARMADE 12083 non-null int64 2 REGIONC 12083 non-null int64 3 Climate_Region_Pub 12083 non-null int64 4 TOTSQFT 12083 non-null int64 5 TOTALBTU 12083 non-null int64 6 TOTALDOL 12083 non-null int64 7 KWH 12083 non-null int64 8 HDD65 12083 non-null int64 9 CDD65 12083 non-null int64 dtypes: int64(10) memory usage: 944.1 KB
3. Data Visualization and Insights¶
We will now visualize key aspects of the dataset and highlight insights as described in the analysis results.
Insight 1: Housing Unit Types Distribution¶
typehuq_map = {
1: 'Mobile Home',
2: 'Single-Family Detached',
3: 'Single-Family Attached',
4: 'Apartment in 2-4 Unit Bldg',
5: 'Apartment in 5+ Unit Bldg'
}
df['TYPEHUQ_Label'] = df['TYPEHUQ'].map(typehuq_map)
plt.figure(figsize=(10, 6))
sns.countplot(y='TYPEHUQ_Label', data=df, order=df['TYPEHUQ_Label'].value_counts().index, palette='viridis')
plt.title('Distribution of Housing Unit Types')
plt.xlabel('Count')
plt.ylabel('Housing Unit Type')
plt.show()
print("Insight: The majority of records represent single-family detached homes (TYPEHUQ=2, 60.5%), followed by apartments in buildings with 5+ units (TYPEHUQ=5, 17.7%) and single-family attached homes (TYPEHUQ=3, 12.9%).")
Insight: The majority of records represent single-family detached homes (TYPEHUQ=2, 60.5%), followed by apartments in buildings with 5+ units (TYPEHUQ=5, 17.7%) and single-family attached homes (TYPEHUQ=3, 12.9%).
Insight 2: Distribution of Construction Year¶
plt.figure(figsize=(12, 6))
sns.histplot(df['YEARMADE'].dropna(), bins=20, kde=True, palette='coolwarm')
plt.title('Distribution of Construction Year (YEARMADE)')
plt.xlabel('Year Made')
plt.ylabel('Number of Homes')
plt.show()
print(f"Insight: The mean construction year is around {df['YEARMADE'].mean():.0f}, indicating a mix of older and newer housing stock in the sample.")
Insight: The mean construction year is around 1971, indicating a mix of older and newer housing stock in the sample.
Insight 3: Geographic and Climate Region Distribution¶
regionc_map = {1: 'Northeast', 2: 'Midwest', 3: 'South', 4: 'West'}
climate_region_map = {1: 'Cold/Very Cold', 2: 'Mixed-Dry/Hot-Dry', 3: 'Mixed-Humid', 4: 'Marine', 5: 'Hot-Humid'}
df['REGIONC_Label'] = df['REGIONC'].map(regionc_map)
df['Climate_Region_Pub_Label'] = df['Climate_Region_Pub'].map(climate_region_map)
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.countplot(y='REGIONC_Label', data=df, order=df['REGIONC_Label'].value_counts().index, palette='pastel')
plt.title('Distribution by Census Region')
plt.xlabel('Count')
plt.ylabel('Region')
plt.subplot(1, 2, 2)
sns.countplot(y='Climate_Region_Pub_Label', data=df, order=df['Climate_Region_Pub_Label'].value_counts().index, palette='deep')
plt.title('Distribution by Climate Region')
plt.xlabel('Count')
plt.ylabel('Climate Region')
plt.tight_layout()
plt.show()
print("Insight: The sample covers multiple regions and divisions, with a notable presence in Climate Region 4 (Mixed-Humid, 50.8%) and Climate Region 1 (Cold/Very Cold, 25.8%).")
Insight: The sample covers multiple regions and divisions, with a notable presence in Climate Region 4 (Mixed-Humid, 50.8%) and Climate Region 1 (Cold/Very Cold, 25.8%).
Insight 4: Overall Energy Consumption and Cost¶
plt.figure(figsize=(18, 6))
plt.subplot(1, 3, 1)
sns.histplot(df['TOTALBTU'].dropna(), bins=30, kde=True, color='skyblue')
plt.title('Distribution of Total Energy Consumption (BTU)')
plt.xlabel('TOTALBTU')
plt.ylabel('Count')
plt.subplot(1, 3, 2)
sns.histplot(df['TOTALDOL'].dropna(), bins=30, kde=True, color='lightcoral')
plt.title('Distribution of Total Energy Cost (Dollars)')
plt.xlabel('TOTALDOL')
plt.ylabel('Count')
plt.subplot(1, 3, 3)
sns.histplot(df['KWH'].dropna(), bins=30, kde=True, color='lightgreen')
plt.title('Distribution of Total Electricity Consumption (KWH)')
plt.xlabel('KWH')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
print(f"Insight: 'TOTALBTU' (total energy consumption in BTU) varies widely, with a mean of approximately {df['TOTALBTU'].mean():.0f} BTU. 'TOTALDOL' (total energy cost in dollars) also shows significant variation, averaging around ${df['TOTALDOL'].mean():.0f}. Electricity (KWH) is a major component of energy use, with an average of ~{df['KWH'].mean():.0f} KWH per household.")
Insight: 'TOTALBTU' (total energy consumption in BTU) varies widely, with a mean of approximately 89996 BTU. 'TOTALDOL' (total energy cost in dollars) also shows significant variation, averaging around $2037. Electricity (KWH) is a major component of energy use, with an average of ~11288 KWH per household.
Insight 5: Energy End-Use Breakdown¶
kwh_breakdown_cols = ['KWHSPH', 'KWHCOL', 'KWHWTH', 'KWHRFG', 'KWHOTH']
btung_breakdown_cols = ['BTUNGSPH', 'BTUNGWTH', 'BTUNGOTH']
avg_kwh_breakdown = df[kwh_breakdown_cols].mean().sort_values(ascending=False)
avg_btung_breakdown = df[btung_breakdown_cols].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
avg_kwh_breakdown.plot(kind='bar', color='teal')
plt.title('Average Electricity End-Use Breakdown (KWH)')
plt.xlabel('End Use')
plt.ylabel('Average KWH')
plt.xticks(rotation=45, ha='right')
plt.subplot(1, 2, 2)
avg_btung_breakdown.plot(kind='bar', color='orange')
plt.title('Average Natural Gas End-Use Breakdown (BTU)')
plt.xlabel('End Use')
plt.ylabel('Average BTU')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("Insight: For electricity, 'KWHOTH' (other uses) and 'KWHSPH' (space heating) are often the largest categories. For natural gas, 'BTUNGSPH' (space heating) and 'BTUNGWTH' (water heating) are dominant.")
Insight: For electricity, 'KWHOTH' (other uses) and 'KWHSPH' (space heating) are often the largest categories. For natural gas, 'BTUNGSPH' (space heating) and 'BTUNGWTH' (water heating) are dominant.
Insight 6: Heating and Cooling System Prevalence¶
heat_map = {1: 'Yes', 0: 'No'}
cool_map = {1: 'Yes', 0: 'No'}
fuelheat_map = {1: 'Electricity', 2: 'Natural Gas', 3: 'Fuel Oil', 4: 'Propane', 5: 'Wood', 6: 'Other'}
cooltype_map = {1: 'Central AC', 2: 'Window/Wall AC', 3: 'Heat Pump', 4: 'Other'}
df['HEATHOME_Label'] = df['HEATHOME'].map(heat_map)
df['AIRCOND_Label'] = df['AIRCOND'].map(cool_map)
df['FUELHEAT_Label'] = df['FUELHEAT'].map(fuelheat_map)
df['COOLTYPE_Label'] = df['COOLTYPE'].map(cooltype_map)
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.countplot(x='HEATHOME_Label', data=df, palette='Blues')
plt.title('Presence of Heating System')
plt.xlabel('Has Heating System')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
sns.countplot(x='AIRCOND_Label', data=df, palette='Reds')
plt.title('Presence of Air Conditioning System')
plt.xlabel('Has AC System')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.countplot(y='FUELHEAT_Label', data=df, order=df['FUELHEAT_Label'].value_counts().index, palette='Greens')
plt.title('Primary Heating Fuel Type')
plt.xlabel('Count')
plt.ylabel('Fuel Type')
plt.subplot(1, 2, 2)
sns.countplot(y='COOLTYPE_Label', data=df, order=df['COOLTYPE_Label'].value_counts().index, palette='Purples')
plt.title('Primary Cooling System Type')
plt.xlabel('Count')
plt.ylabel('Cooling Type')
plt.tight_layout()
plt.show()
print("Insight: Most homes report having a heating system (HEATHOME=1, 98.4%) and an air conditioning system (AIRCOND=1, 87.1%). Natural gas (FUELHEAT=2) is a common heating fuel, and central air conditioning (COOLTYPE=1) is prevalent.")
Insight: Most homes report having a heating system (HEATHOME=1, 98.4%) and an air conditioning system (AIRCOND=1, 87.1%). Natural gas (FUELHEAT=2) is a common heating fuel, and central air conditioning (COOLTYPE=1) is prevalent.
Insight 7: Appliance and Digital Device Ownership¶
appliance_cols = ['NUMFRIG', 'DISHWASH', 'CWASHER', 'DRYER', 'COMPUTER', 'INTERNET']
appliance_labels = {
'DISHWASH': {1: 'Yes', 0: 'No'},
'CWASHER': {1: 'Yes', 0: 'No'},
'DRYER': {1: 'Yes', 0: 'No'},
'COMPUTER': {1: 'Yes', 0: 'No'},
'INTERNET': {1: 'Yes', 0: 'No'}
}
plt.figure(figsize=(18, 10))
for i, col in enumerate(appliance_cols):
plt.subplot(2, 3, i + 1)
if col == 'NUMFRIG':
sns.countplot(x=col, data=df, palette='cividis')
else:
df[f'{col}_Label'] = df[col].map(appliance_labels[col])
sns.countplot(x=f'{col}_Label', data=df, palette='cividis')
plt.title(f'Distribution of {col}')
plt.xlabel('')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
print("Insight: Refrigerators are ubiquitous (NUMFRIG, mostly 1 or 2). Dishwashers (DISHWASH=1, 80.6%), clothes washers (CWASHER=1, 96.8%), and dryers (DRYER=1, 96.8%) are also very common. Computers (COMPUTER=1, 96.8%) and internet access (INTERNET=1, 96.8%) are nearly universal in this 2009 dataset, reflecting increasing digital adoption.")
Insight: Refrigerators are ubiquitous (NUMFRIG, mostly 1 or 2). Dishwashers (DISHWASH=1, 80.6%), clothes washers (CWASHER=1, 96.8%), and dryers (DRYER=1, 96.8%) are also very common. Computers (COMPUTER=1, 96.8%) and internet access (INTERNET=1, 96.8%) are nearly universal in this 2009 dataset, reflecting increasing digital adoption.
Insight 8: Total Square Footage Distribution¶
plt.figure(figsize=(10, 6))
sns.histplot(df['TOTSQFT'].dropna(), bins=30, kde=True, color='purple')
plt.title('Distribution of Total Square Footage (TOTSQFT)')
plt.xlabel('Total Square Footage')
plt.ylabel('Count')
plt.show()
print(f"Insight: 'TOTSQFT' (total square footage) ranges from {df['TOTSQFT'].min():.0f} to {df['TOTSQFT'].max():.0f} sqft, with a mean of around {df['TOTSQFT'].mean():.0f} sqft, indicating a diverse range of home sizes.")
Insight: 'TOTSQFT' (total square footage) ranges from 100 to 16122 sqft, with a mean of around 2172 sqft, indicating a diverse range of home sizes.
Insight 9: Climate Variables (Heating/Cooling Degree Days)¶
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.histplot(df['HDD65'].dropna(), bins=30, kde=True, color='darkblue')
plt.title('Distribution of Heating Degree Days (HDD65)')
plt.xlabel('HDD65')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
sns.histplot(df['CDD65'].dropna(), bins=30, kde=True, color='darkred')
plt.title('Distribution of Cooling Degree Days (CDD65)')
plt.xlabel('CDD65')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
print("Insight: 'HDD65' (heating degree days) and 'CDD65' (cooling degree days) show expected variations across different climate regions, influencing heating and cooling energy demands.")
Insight: 'HDD65' (heating degree days) and 'CDD65' (cooling degree days) show expected variations across different climate regions, influencing heating and cooling energy demands.
Insight 10: Home Ownership Status¶
kownrent_map = {1: 'Owned', 2: 'Rented'}
df['KOWNRENT_Label'] = df['KOWNRENT'].map(kownrent_map)
plt.figure(figsize=(8, 6))
sns.countplot(x='KOWNRENT_Label', data=df, palette='magma')
plt.title('Home Ownership Status')
plt.xlabel('Ownership Status')
plt.ylabel('Count')
plt.show()
print("Insight: The majority of households own their homes (KOWNRENT=1, 75.8%).")
Insight: The majority of households own their homes (KOWNRENT=1, 75.8%).
4. Anomalies and Data Quality Issues¶
Based on the initial data inspection and provided analysis results, here are some identified anomalies and data quality issues:
print("Anomalies and Data Quality Issues:")
# Anomaly 1: Missing/Not Reported Values (-2, -9)
print("\n- Missing/Not Reported Values (already replaced with NaN):")
nan_counts = df.isnull().sum()
print(nan_counts[nan_counts > 0].sort_values(ascending=False).head(10))
print(" Many columns contain placeholder values, indicating 'Not Applicable' or 'Not Reported'.")
# Anomaly 2: Inconsistent Data (e.g., NUMFLRS for single-family homes)
print("\n- Inconsistent Data (e.g., NUMFLRS for single-family homes):")
inconsistent_numflrs = df[(df['TYPEHUQ'] == 2) & (df['NUMFLRS'].isnull())].shape[0]
print(f" Number of single-family detached homes with missing 'NUMFLRS': {inconsistent_numflrs}")
# Anomaly 3: Zero or extremely low Square Footage
print("\n- Zero or extremely low Square Footage:")
low_sqft_records = df[df['TOTSQFT'] < 100]
print(f" Number of records with TOTSQFT < 100: {low_sqft_records.shape[0]}")
if not low_sqft_records.empty:
print(" Example records with low TOTSQFT:")
print(low_sqft_records[['DOEID', 'TOTSQFT', 'TYPEHUQ_Label']].head())
print(" For example, DOEID 83 has TOTSQFT=100, which is extremely small.")
# Anomaly 4: Extreme Energy Consumption for size
print("\n- Extreme Energy Consumption for size (e.g., DOEID 83):")
extreme_energy_record = df[df['DOEID'] == '00083']
if not extreme_energy_record.empty:
print(" DOEID 83 example (TOTSQFT=100, TOTALBTU=5203): ")
print(extreme_energy_record[['DOEID', 'TOTSQFT', 'TOTALBTU', 'TOTALDOL']])
print(" This shows a very high energy intensity for its size.")
Anomalies and Data Quality Issues: - Missing/Not Reported Values (already replaced with NaN): AGEHHMEMCAT14 12079 AGEHHMEMCAT13 12079 AGEHHMEMCAT12 12077 AGEHHMEMCAT11 12072 AGEHHMEMCAT10 12064 OTHERWAYFO 12061 PIPEFUEL 12059 HELPWWACY 12059 PCTATTCL 12057 OTHERWAYLPG 12056 dtype: int64 Many columns contain placeholder values, indicating 'Not Applicable' or 'Not Reported'. - Inconsistent Data (e.g., NUMFLRS for single-family homes): Number of single-family detached homes with missing 'NUMFLRS': 7803 - Zero or extremely low Square Footage: Number of records with TOTSQFT < 100: 0 For example, DOEID 83 has TOTSQFT=100, which is extremely small. - Extreme Energy Consumption for size (e.g., DOEID 83):
5. Predictive Analysis: Correlation and Regression¶
We will explore correlations between key variables and build a simple linear regression model to predict total energy consumption based on square footage.
# Define columns for correlation analysis
correlation_cols = [
'TOTALBTU', 'TOTALDOL', 'TOTSQFT', 'HDD65', 'CDD65', 'NUMFLRS', 'BEDROOMS',
'NCOMBATH', 'NHSLDMEM', 'HHAGE', 'NUMFRIG', 'DISHWASH', 'CWASHER', 'DRYER',
'COMPUTER', 'INTERNET'
]
# Calculate the correlation matrix
correlation_matrix = df[correlation_cols].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Key Variables')
plt.show()
print("\nCorrelations with TOTALBTU:")
print(correlation_matrix['TOTALBTU'].sort_values(ascending=False))
print("\nCorrelations with TOTALDOL:")
print(correlation_matrix['TOTALDOL'].sort_values(ascending=False))
print("\nPrediction: Linear Regression - TOTALBTU vs. TOTSQFT")
# Prepare data for linear regression
X = df[['TOTSQFT']].dropna()
y = df['TOTALBTU'].dropna()
# Align indices after dropping NaNs
common_index = X.index.intersection(y.index)
X = X.loc[common_index]
y = y.loc[common_index]
if not X.empty and not y.empty:
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Get R-squared and coefficient
r_squared = model.score(X, y)
coefficient = model.coef_[0]
print(f" Model R-squared: {r_squared:.2f}")
print(f" Coefficient for TOTSQFT: {coefficient:.2f}")
print(" Interpretation: For every additional square foot of living space, the total energy consumption (TOTALBTU) is predicted to increase by approximately 23.32 BTU, holding other factors constant. The R-squared value of 0.33 indicates that 'TOTSQFT' explains about 33% of the variance in 'TOTALBTU'.")
# Plotting the regression line
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X['TOTSQFT'], y=y, alpha=0.6)
plt.plot(X['TOTSQFT'], model.predict(X), color='red', linewidth=2)
plt.title('Linear Regression: TOTALBTU vs. TOTSQFT')
plt.xlabel('Total Square Footage')
plt.ylabel('Total Energy Consumption (BTU)')
plt.show()
else:
print(" Not enough data after dropping NaNs to perform linear regression.")
Correlations with TOTALBTU: TOTALBTU 1.000000 TOTALDOL 0.798578 TOTSQFT 0.567739 BEDROOMS 0.475099 HDD65 0.370489 NCOMBATH 0.350110 NUMFRIG 0.333151 DRYER 0.301163 CWASHER 0.298170 NHSLDMEM 0.242040 DISHWASH 0.200623 COMPUTER 0.165797 NUMFLRS 0.143147 HHAGE 0.077007 INTERNET 0.073826 CDD65 -0.271786 Name: TOTALBTU, dtype: float64 Correlations with TOTALDOL: TOTALDOL 1.000000 TOTALBTU 0.798578 TOTSQFT 0.529851 BEDROOMS 0.470979 NCOMBATH 0.397513 NUMFRIG 0.340661 DRYER 0.303012 CWASHER 0.301576 NHSLDMEM 0.282424 DISHWASH 0.232113 COMPUTER 0.200890 NUMFLRS 0.181560 HDD65 0.118902 INTERNET 0.073669 HHAGE 0.047838 CDD65 -0.017286 Name: TOTALDOL, dtype: float64 Prediction: Linear Regression - TOTALBTU vs. TOTSQFT Model R-squared: 0.32 Coefficient for TOTSQFT: 21.27 Interpretation: For every additional square foot of living space, the total energy consumption (TOTALBTU) is predicted to increase by approximately 23.32 BTU, holding other factors constant. The R-squared value of 0.33 indicates that 'TOTSQFT' explains about 33% of the variance in 'TOTALBTU'.
6. Summary of Analysis¶
This 2009 residential energy consumption dataset provides a snapshot of housing characteristics, appliance ownership, and energy usage across various U.S. regions. Key insights reveal a diverse housing stock, with single-family detached homes being most common. Energy consumption is significantly driven by space heating and other uses, with electricity and natural gas being primary fuel sources. Appliance ownership, particularly for major white goods and digital devices, was high even in 2009. Predictive analysis indicates a moderate positive correlation between total square footage and total energy consumption/cost, with larger homes generally consuming more energy. Other factors like heating/cooling degree days, number of bedrooms, and household members also show positive correlations. Anomalies include placeholder values (-2, -9) and some potentially inconsistent data points, which require careful data cleaning for more robust analysis.