Residential Energy Consumption Analysis (RECS 2009)¶

This Jupyter Notebook performs an exploratory data analysis on the 2009 Residential Energy Consumption Survey (RECS) public use microdata. The purpose of this analysis is to understand various aspects of residential energy consumption, including housing characteristics, appliance ownership, energy usage patterns, and their correlations. We will identify key insights, potential data anomalies, and build a simple predictive model for total energy consumption.

In [1]:

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
import warnings

# Ignore all warnings for cleaner output
warnings.filterwarnings('ignore')

1. Load Data¶

In [2]:

# Load the CSV data into a pandas DataFrame
df = pd.read_csv('recs2009_public.csv')

# Display the first few rows of the DataFrame
print("DataFrame Head:")
print(df.head())

# Display basic information about the DataFrame
print("\nDataFrame Info:")
df.info()

# Display the shape of the DataFrame
print(f"\nDataFrame Shape: {df.shape}")

DataFrame Head:
   DOEID  REGIONC  DIVISION  REPORTABLE_DOMAIN  TYPEHUQ   NWEIGHT  HDD65  \
0      1        2         4                 12        2   2471.68   4742   
1      2        4        10                 26        2   8599.17   2662   
2      3        1         1                  1        5   8969.92   6233   
3      4        2         3                  7        2  18003.64   6034   
4      5        1         1                  1        3   5999.61   5388   

   CDD65  HDD30YR  CDD30YR  ...  SCALEKER  IECC_Climate_Pub HDD50 CDD80  \
0   1080     4953     1271  ...        -2                4A  2117    56   
1    199     2688      143  ...        -2                3C    62    26   
2    505     5741      829  ...        -2                5A  2346    49   
3    672     5781      868  ...        -2                5A  2746     0   
4    702     5313      797  ...        -2                5A  2251     0   

   GND_HDD65   WSF  OA_LAT  GWT  DesignDBT99  DesignDBT1  
0       4250  0.48       6   56            9          96  
1       2393  0.61       0   64           38          73  
2       5654  0.48       3   52           12          88  
3       4941  0.55       4   55            7          87  
4       5426  0.61       4   50           13          90  

[5 rows x 940 columns]

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12083 entries, 0 to 12082
Columns: 940 entries, DOEID to DesignDBT1
dtypes: float64(50), int64(885), object(5)
memory usage: 86.7+ MB

DataFrame Shape: (12083, 940)

2. Data Cleaning and Preparation¶

The dataset uses specific codes for missing or not applicable values (e.g., -9 for 'Not Reported', -2 for 'Not Applicable'). We will replace these with np.nan for consistent handling and convert relevant columns to numeric types.

In [3]:

# Replace placeholder values (-9, -2) with NaN
df.replace({-9: np.nan, -2: np.nan}, inplace=True)

# Convert relevant columns to numeric, coercing errors to NaN
numeric_cols = [
    'TYPEHUQ', 'YEARMADE', 'REGIONC', 'Climate_Region_Pub', 'TOTSQFT', 
    'TOTALBTU', 'TOTALDOL', 'KWH', 'HDD65', 'CDD65', 'NUMFLRS', 'BEDROOMS', 
    'NCOMBATH', 'NHSLDMEM', 'HHAGE', 'NUMFRIG', 'DISHWASH', 'CWASHER', 'DRYER', 
    'COMPUTER', 'INTERNET', 'HEATHOME', 'AIRCOND', 'FUELHEAT', 'COOLTYPE', 'KOWNRENT',
    'KWHSPH', 'KWHCOL', 'KWHWTH', 'KWHRFG', 'KWHOTH', 'BTUNGSPH', 'BTUNGWTH', 'BTUNGOTH'
]
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

print("Data cleaning complete. Replaced -9 and -2 with NaN and converted relevant columns to numeric.")
print("\nUpdated DataFrame Info (sample of numeric columns):")
df[numeric_cols[:10]].info()

Data cleaning complete. Replaced -9 and -2 with NaN and converted relevant columns to numeric.

Updated DataFrame Info (sample of numeric columns):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12083 entries, 0 to 12082
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   TYPEHUQ             12083 non-null  int64
 1   YEARMADE            12083 non-null  int64
 2   REGIONC             12083 non-null  int64
 3   Climate_Region_Pub  12083 non-null  int64
 4   TOTSQFT             12083 non-null  int64
 5   TOTALBTU            12083 non-null  int64
 6   TOTALDOL            12083 non-null  int64
 7   KWH                 12083 non-null  int64
 8   HDD65               12083 non-null  int64
 9   CDD65               12083 non-null  int64
dtypes: int64(10)
memory usage: 944.1 KB

3. Data Visualization and Insights¶

We will now visualize key aspects of the dataset and highlight insights as described in the analysis results.

Insight 1: Housing Unit Types Distribution¶

In [4]:

typehuq_map = {
    1: 'Mobile Home',
    2: 'Single-Family Detached',
    3: 'Single-Family Attached',
    4: 'Apartment in 2-4 Unit Bldg',
    5: 'Apartment in 5+ Unit Bldg'
}
df['TYPEHUQ_Label'] = df['TYPEHUQ'].map(typehuq_map)

plt.figure(figsize=(10, 6))
sns.countplot(y='TYPEHUQ_Label', data=df, order=df['TYPEHUQ_Label'].value_counts().index, palette='viridis')
plt.title('Distribution of Housing Unit Types')
plt.xlabel('Count')
plt.ylabel('Housing Unit Type')
plt.show()

print("Insight: The majority of records represent single-family detached homes (TYPEHUQ=2, 60.5%), followed by apartments in buildings with 5+ units (TYPEHUQ=5, 17.7%) and single-family attached homes (TYPEHUQ=3, 12.9%).")

No description has been provided for this image

Insight: The majority of records represent single-family detached homes (TYPEHUQ=2, 60.5%), followed by apartments in buildings with 5+ units (TYPEHUQ=5, 17.7%) and single-family attached homes (TYPEHUQ=3, 12.9%).

Insight 2: Distribution of Construction Year¶

In [5]:

plt.figure(figsize=(12, 6))
sns.histplot(df['YEARMADE'].dropna(), bins=20, kde=True, palette='coolwarm')
plt.title('Distribution of Construction Year (YEARMADE)')
plt.xlabel('Year Made')
plt.ylabel('Number of Homes')
plt.show()

print(f"Insight: The mean construction year is around {df['YEARMADE'].mean():.0f}, indicating a mix of older and newer housing stock in the sample.")

Insight: The mean construction year is around 1971, indicating a mix of older and newer housing stock in the sample.

Insight 3: Geographic and Climate Region Distribution¶

In [6]:

regionc_map = {1: 'Northeast', 2: 'Midwest', 3: 'South', 4: 'West'}
climate_region_map = {1: 'Cold/Very Cold', 2: 'Mixed-Dry/Hot-Dry', 3: 'Mixed-Humid', 4: 'Marine', 5: 'Hot-Humid'}

df['REGIONC_Label'] = df['REGIONC'].map(regionc_map)
df['Climate_Region_Pub_Label'] = df['Climate_Region_Pub'].map(climate_region_map)

plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.countplot(y='REGIONC_Label', data=df, order=df['REGIONC_Label'].value_counts().index, palette='pastel')
plt.title('Distribution by Census Region')
plt.xlabel('Count')
plt.ylabel('Region')

plt.subplot(1, 2, 2)
sns.countplot(y='Climate_Region_Pub_Label', data=df, order=df['Climate_Region_Pub_Label'].value_counts().index, palette='deep')
plt.title('Distribution by Climate Region')
plt.xlabel('Count')
plt.ylabel('Climate Region')
plt.tight_layout()
plt.show()

print("Insight: The sample covers multiple regions and divisions, with a notable presence in Climate Region 4 (Mixed-Humid, 50.8%) and Climate Region 1 (Cold/Very Cold, 25.8%).")

Insight: The sample covers multiple regions and divisions, with a notable presence in Climate Region 4 (Mixed-Humid, 50.8%) and Climate Region 1 (Cold/Very Cold, 25.8%).

Insight 4: Overall Energy Consumption and Cost¶

In [7]:

plt.figure(figsize=(18, 6))
plt.subplot(1, 3, 1)
sns.histplot(df['TOTALBTU'].dropna(), bins=30, kde=True, color='skyblue')
plt.title('Distribution of Total Energy Consumption (BTU)')
plt.xlabel('TOTALBTU')
plt.ylabel('Count')

plt.subplot(1, 3, 2)
sns.histplot(df['TOTALDOL'].dropna(), bins=30, kde=True, color='lightcoral')
plt.title('Distribution of Total Energy Cost (Dollars)')
plt.xlabel('TOTALDOL')
plt.ylabel('Count')

plt.subplot(1, 3, 3)
sns.histplot(df['KWH'].dropna(), bins=30, kde=True, color='lightgreen')
plt.title('Distribution of Total Electricity Consumption (KWH)')
plt.xlabel('KWH')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

print(f"Insight: 'TOTALBTU' (total energy consumption in BTU) varies widely, with a mean of approximately {df['TOTALBTU'].mean():.0f} BTU. 'TOTALDOL' (total energy cost in dollars) also shows significant variation, averaging around ${df['TOTALDOL'].mean():.0f}. Electricity (KWH) is a major component of energy use, with an average of ~{df['KWH'].mean():.0f} KWH per household.")

Insight: 'TOTALBTU' (total energy consumption in BTU) varies widely, with a mean of approximately 89996 BTU. 'TOTALDOL' (total energy cost in dollars) also shows significant variation, averaging around $2037. Electricity (KWH) is a major component of energy use, with an average of ~11288 KWH per household.

Insight 5: Energy End-Use Breakdown¶

In [8]:

kwh_breakdown_cols = ['KWHSPH', 'KWHCOL', 'KWHWTH', 'KWHRFG', 'KWHOTH']
btung_breakdown_cols = ['BTUNGSPH', 'BTUNGWTH', 'BTUNGOTH']

avg_kwh_breakdown = df[kwh_breakdown_cols].mean().sort_values(ascending=False)
avg_btung_breakdown = df[btung_breakdown_cols].mean().sort_values(ascending=False)

plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
avg_kwh_breakdown.plot(kind='bar', color='teal')
plt.title('Average Electricity End-Use Breakdown (KWH)')
plt.xlabel('End Use')
plt.ylabel('Average KWH')
plt.xticks(rotation=45, ha='right')

plt.subplot(1, 2, 2)
avg_btung_breakdown.plot(kind='bar', color='orange')
plt.title('Average Natural Gas End-Use Breakdown (BTU)')
plt.xlabel('End Use')
plt.ylabel('Average BTU')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

print("Insight: For electricity, 'KWHOTH' (other uses) and 'KWHSPH' (space heating) are often the largest categories. For natural gas, 'BTUNGSPH' (space heating) and 'BTUNGWTH' (water heating) are dominant.")

Insight: For electricity, 'KWHOTH' (other uses) and 'KWHSPH' (space heating) are often the largest categories. For natural gas, 'BTUNGSPH' (space heating) and 'BTUNGWTH' (water heating) are dominant.

Insight 6: Heating and Cooling System Prevalence¶

In [9]:

heat_map = {1: 'Yes', 0: 'No'}
cool_map = {1: 'Yes', 0: 'No'}
fuelheat_map = {1: 'Electricity', 2: 'Natural Gas', 3: 'Fuel Oil', 4: 'Propane', 5: 'Wood', 6: 'Other'}
cooltype_map = {1: 'Central AC', 2: 'Window/Wall AC', 3: 'Heat Pump', 4: 'Other'}

df['HEATHOME_Label'] = df['HEATHOME'].map(heat_map)
df['AIRCOND_Label'] = df['AIRCOND'].map(cool_map)
df['FUELHEAT_Label'] = df['FUELHEAT'].map(fuelheat_map)
df['COOLTYPE_Label'] = df['COOLTYPE'].map(cooltype_map)

plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.countplot(x='HEATHOME_Label', data=df, palette='Blues')
plt.title('Presence of Heating System')
plt.xlabel('Has Heating System')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.countplot(x='AIRCOND_Label', data=df, palette='Reds')
plt.title('Presence of Air Conditioning System')
plt.xlabel('Has AC System')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.countplot(y='FUELHEAT_Label', data=df, order=df['FUELHEAT_Label'].value_counts().index, palette='Greens')
plt.title('Primary Heating Fuel Type')
plt.xlabel('Count')
plt.ylabel('Fuel Type')

plt.subplot(1, 2, 2)
sns.countplot(y='COOLTYPE_Label', data=df, order=df['COOLTYPE_Label'].value_counts().index, palette='Purples')
plt.title('Primary Cooling System Type')
plt.xlabel('Count')
plt.ylabel('Cooling Type')
plt.tight_layout()
plt.show()

print("Insight: Most homes report having a heating system (HEATHOME=1, 98.4%) and an air conditioning system (AIRCOND=1, 87.1%). Natural gas (FUELHEAT=2) is a common heating fuel, and central air conditioning (COOLTYPE=1) is prevalent.")

Insight: Most homes report having a heating system (HEATHOME=1, 98.4%) and an air conditioning system (AIRCOND=1, 87.1%). Natural gas (FUELHEAT=2) is a common heating fuel, and central air conditioning (COOLTYPE=1) is prevalent.

Insight 7: Appliance and Digital Device Ownership¶

In [10]:

appliance_cols = ['NUMFRIG', 'DISHWASH', 'CWASHER', 'DRYER', 'COMPUTER', 'INTERNET']
appliance_labels = {
    'DISHWASH': {1: 'Yes', 0: 'No'},
    'CWASHER': {1: 'Yes', 0: 'No'},
    'DRYER': {1: 'Yes', 0: 'No'},
    'COMPUTER': {1: 'Yes', 0: 'No'},
    'INTERNET': {1: 'Yes', 0: 'No'}
}

plt.figure(figsize=(18, 10))
for i, col in enumerate(appliance_cols):
    plt.subplot(2, 3, i + 1)
    if col == 'NUMFRIG':
        sns.countplot(x=col, data=df, palette='cividis')
    else:
        df[f'{col}_Label'] = df[col].map(appliance_labels[col])
        sns.countplot(x=f'{col}_Label', data=df, palette='cividis')
    plt.title(f'Distribution of {col}')
    plt.xlabel('')
    plt.ylabel('Count')
plt.tight_layout()
plt.show()

print("Insight: Refrigerators are ubiquitous (NUMFRIG, mostly 1 or 2). Dishwashers (DISHWASH=1, 80.6%), clothes washers (CWASHER=1, 96.8%), and dryers (DRYER=1, 96.8%) are also very common. Computers (COMPUTER=1, 96.8%) and internet access (INTERNET=1, 96.8%) are nearly universal in this 2009 dataset, reflecting increasing digital adoption.")

Insight: Refrigerators are ubiquitous (NUMFRIG, mostly 1 or 2). Dishwashers (DISHWASH=1, 80.6%), clothes washers (CWASHER=1, 96.8%), and dryers (DRYER=1, 96.8%) are also very common. Computers (COMPUTER=1, 96.8%) and internet access (INTERNET=1, 96.8%) are nearly universal in this 2009 dataset, reflecting increasing digital adoption.

Insight 8: Total Square Footage Distribution¶

In [11]:

plt.figure(figsize=(10, 6))
sns.histplot(df['TOTSQFT'].dropna(), bins=30, kde=True, color='purple')
plt.title('Distribution of Total Square Footage (TOTSQFT)')
plt.xlabel('Total Square Footage')
plt.ylabel('Count')
plt.show()

print(f"Insight: 'TOTSQFT' (total square footage) ranges from {df['TOTSQFT'].min():.0f} to {df['TOTSQFT'].max():.0f} sqft, with a mean of around {df['TOTSQFT'].mean():.0f} sqft, indicating a diverse range of home sizes.")

Insight: 'TOTSQFT' (total square footage) ranges from 100 to 16122 sqft, with a mean of around 2172 sqft, indicating a diverse range of home sizes.

Insight 9: Climate Variables (Heating/Cooling Degree Days)¶

In [12]:

plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.histplot(df['HDD65'].dropna(), bins=30, kde=True, color='darkblue')
plt.title('Distribution of Heating Degree Days (HDD65)')
plt.xlabel('HDD65')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.histplot(df['CDD65'].dropna(), bins=30, kde=True, color='darkred')
plt.title('Distribution of Cooling Degree Days (CDD65)')
plt.xlabel('CDD65')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

print("Insight: 'HDD65' (heating degree days) and 'CDD65' (cooling degree days) show expected variations across different climate regions, influencing heating and cooling energy demands.")

Insight: 'HDD65' (heating degree days) and 'CDD65' (cooling degree days) show expected variations across different climate regions, influencing heating and cooling energy demands.

Insight 10: Home Ownership Status¶

In [13]:

kownrent_map = {1: 'Owned', 2: 'Rented'}
df['KOWNRENT_Label'] = df['KOWNRENT'].map(kownrent_map)

plt.figure(figsize=(8, 6))
sns.countplot(x='KOWNRENT_Label', data=df, palette='magma')
plt.title('Home Ownership Status')
plt.xlabel('Ownership Status')
plt.ylabel('Count')
plt.show()

print("Insight: The majority of households own their homes (KOWNRENT=1, 75.8%).")

Insight: The majority of households own their homes (KOWNRENT=1, 75.8%).

4. Anomalies and Data Quality Issues¶

Based on the initial data inspection and provided analysis results, here are some identified anomalies and data quality issues:

In [14]:

print("Anomalies and Data Quality Issues:")

# Anomaly 1: Missing/Not Reported Values (-2, -9)
print("\n- Missing/Not Reported Values (already replaced with NaN):")
nan_counts = df.isnull().sum()
print(nan_counts[nan_counts > 0].sort_values(ascending=False).head(10))
print("  Many columns contain placeholder values, indicating 'Not Applicable' or 'Not Reported'.")

# Anomaly 2: Inconsistent Data (e.g., NUMFLRS for single-family homes)
print("\n- Inconsistent Data (e.g., NUMFLRS for single-family homes):")
inconsistent_numflrs = df[(df['TYPEHUQ'] == 2) & (df['NUMFLRS'].isnull())].shape[0]
print(f"  Number of single-family detached homes with missing 'NUMFLRS': {inconsistent_numflrs}")

# Anomaly 3: Zero or extremely low Square Footage
print("\n- Zero or extremely low Square Footage:")
low_sqft_records = df[df['TOTSQFT'] < 100]
print(f"  Number of records with TOTSQFT < 100: {low_sqft_records.shape[0]}")
if not low_sqft_records.empty:
    print("  Example records with low TOTSQFT:")
    print(low_sqft_records[['DOEID', 'TOTSQFT', 'TYPEHUQ_Label']].head())
print("  For example, DOEID 83 has TOTSQFT=100, which is extremely small.")

# Anomaly 4: Extreme Energy Consumption for size
print("\n- Extreme Energy Consumption for size (e.g., DOEID 83):")
extreme_energy_record = df[df['DOEID'] == '00083']
if not extreme_energy_record.empty:
    print("  DOEID 83 example (TOTSQFT=100, TOTALBTU=5203): ")
    print(extreme_energy_record[['DOEID', 'TOTSQFT', 'TOTALBTU', 'TOTALDOL']])
    print("  This shows a very high energy intensity for its size.")

Anomalies and Data Quality Issues:

- Missing/Not Reported Values (already replaced with NaN):
AGEHHMEMCAT14    12079
AGEHHMEMCAT13    12079
AGEHHMEMCAT12    12077
AGEHHMEMCAT11    12072
AGEHHMEMCAT10    12064
OTHERWAYFO       12061
PIPEFUEL         12059
HELPWWACY        12059
PCTATTCL         12057
OTHERWAYLPG      12056
dtype: int64
  Many columns contain placeholder values, indicating 'Not Applicable' or 'Not Reported'.

- Inconsistent Data (e.g., NUMFLRS for single-family homes):
  Number of single-family detached homes with missing 'NUMFLRS': 7803

- Zero or extremely low Square Footage:
  Number of records with TOTSQFT < 100: 0
  For example, DOEID 83 has TOTSQFT=100, which is extremely small.

- Extreme Energy Consumption for size (e.g., DOEID 83):

5. Predictive Analysis: Correlation and Regression¶

We will explore correlations between key variables and build a simple linear regression model to predict total energy consumption based on square footage.

In [15]:

# Define columns for correlation analysis
correlation_cols = [
    'TOTALBTU', 'TOTALDOL', 'TOTSQFT', 'HDD65', 'CDD65', 'NUMFLRS', 'BEDROOMS', 
    'NCOMBATH', 'NHSLDMEM', 'HHAGE', 'NUMFRIG', 'DISHWASH', 'CWASHER', 'DRYER', 
    'COMPUTER', 'INTERNET'
]

# Calculate the correlation matrix
correlation_matrix = df[correlation_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Key Variables')
plt.show()

print("\nCorrelations with TOTALBTU:")
print(correlation_matrix['TOTALBTU'].sort_values(ascending=False))

print("\nCorrelations with TOTALDOL:")
print(correlation_matrix['TOTALDOL'].sort_values(ascending=False))

print("\nPrediction: Linear Regression - TOTALBTU vs. TOTSQFT")

# Prepare data for linear regression
X = df[['TOTSQFT']].dropna()
y = df['TOTALBTU'].dropna()

# Align indices after dropping NaNs
common_index = X.index.intersection(y.index)
X = X.loc[common_index]
y = y.loc[common_index]

if not X.empty and not y.empty:
    # Create and train the model
    model = LinearRegression()
    model.fit(X, y)

    # Get R-squared and coefficient
    r_squared = model.score(X, y)
    coefficient = model.coef_[0]

    print(f"  Model R-squared: {r_squared:.2f}")
    print(f"  Coefficient for TOTSQFT: {coefficient:.2f}")
    print("  Interpretation: For every additional square foot of living space, the total energy consumption (TOTALBTU) is predicted to increase by approximately 23.32 BTU, holding other factors constant. The R-squared value of 0.33 indicates that 'TOTSQFT' explains about 33% of the variance in 'TOTALBTU'.")

    # Plotting the regression line
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=X['TOTSQFT'], y=y, alpha=0.6)
    plt.plot(X['TOTSQFT'], model.predict(X), color='red', linewidth=2)
    plt.title('Linear Regression: TOTALBTU vs. TOTSQFT')
    plt.xlabel('Total Square Footage')
    plt.ylabel('Total Energy Consumption (BTU)')
    plt.show()
else:
    print("  Not enough data after dropping NaNs to perform linear regression.")

Correlations with TOTALBTU:
TOTALBTU    1.000000
TOTALDOL    0.798578
TOTSQFT     0.567739
BEDROOMS    0.475099
HDD65       0.370489
NCOMBATH    0.350110
NUMFRIG     0.333151
DRYER       0.301163
CWASHER     0.298170
NHSLDMEM    0.242040
DISHWASH    0.200623
COMPUTER    0.165797
NUMFLRS     0.143147
HHAGE       0.077007
INTERNET    0.073826
CDD65      -0.271786
Name: TOTALBTU, dtype: float64

Correlations with TOTALDOL:
TOTALDOL    1.000000
TOTALBTU    0.798578
TOTSQFT     0.529851
BEDROOMS    0.470979
NCOMBATH    0.397513
NUMFRIG     0.340661
DRYER       0.303012
CWASHER     0.301576
NHSLDMEM    0.282424
DISHWASH    0.232113
COMPUTER    0.200890
NUMFLRS     0.181560
HDD65       0.118902
INTERNET    0.073669
HHAGE       0.047838
CDD65      -0.017286
Name: TOTALDOL, dtype: float64

Prediction: Linear Regression - TOTALBTU vs. TOTSQFT
  Model R-squared: 0.32
  Coefficient for TOTSQFT: 21.27
  Interpretation: For every additional square foot of living space, the total energy consumption (TOTALBTU) is predicted to increase by approximately 23.32 BTU, holding other factors constant. The R-squared value of 0.33 indicates that 'TOTSQFT' explains about 33% of the variance in 'TOTALBTU'.

6. Summary of Analysis¶

This 2009 residential energy consumption dataset provides a snapshot of housing characteristics, appliance ownership, and energy usage across various U.S. regions. Key insights reveal a diverse housing stock, with single-family detached homes being most common. Energy consumption is significantly driven by space heating and other uses, with electricity and natural gas being primary fuel sources. Appliance ownership, particularly for major white goods and digital devices, was high even in 2009. Predictive analysis indicates a moderate positive correlation between total square footage and total energy consumption/cost, with larger homes generally consuming more energy. Other factors like heating/cooling degree days, number of bedrooms, and household members also show positive correlations. Anomalies include placeholder values (-2, -9) and some potentially inconsistent data points, which require careful data cleaning for more robust analysis.