Economic Development Corporation (EDC) Data Analysis¶

Purpose: This notebook analyzes data from the edc_data_big_table.csv file, which contains information about various Economic Development Corporations (EDCs) primarily in Texas. The analysis aims to:

Understand the characteristics of EDCs (Type, Location, Fiscal Year).
Explore their financial performance (Revenue sources, Expenses, Cash holdings), acknowledging data limitations.
Identify common objectives and assets held by these corporations.
Visualize key trends and distributions.
Highlight data quality issues, anomalies, and insights derived from the dataset.
Assess the feasibility of predictive modeling based on the available data.

In [1]:

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Ignore common warnings
warnings.filterwarnings('ignore')

# Set plot style
sns.set_style('whitegrid')

In [2]:

# Load Data
df = pd.read_csv('edc_data_big_table.csv')

Initial Data Exploration and Cleaning¶

In [3]:

# Display basic information and data types
print("DataFrame Info:")
df.info()

# Display first few rows
print("\nDataFrame Head:")
df.head()

# Identify columns with boolean-like string values ('true'/'false')
bool_cols = [
    'ASSETS Buildings', 'ASSETS Commercial Buildings', 'ASSETS Equipment',
    'ASSETS Industrial Parks & Sites', 'ASSETS Land', 'ASSETS Other',
    'ASSETS Recreational Facilities', 'OBJECTIVES Infrastructure',
    'OBJECTIVES Job Creation & Job Retention', 'OBJECTIVES Sports Facilities & Recreation',
    'OBJECTIVES Tourism', 'OBJECTIVES Other'
]

# Convert boolean-like columns to actual booleans (True/False/NaN)
for col in bool_cols:
    if col in df.columns:
        # Map 'true' to True, 'false' to False, others to NaN
        df[col] = df[col].astype(str).str.lower().map({'true': True, 'false': False}).astype('boolean')

# Identify financial columns
financial_cols = [
    'REVENUE Bonds', 'REVENUE Grants', 'REVENUE Other', 'REVENUE Sales Tax',
    'REVENUE Users Fees', 'REVENUE Total', 'EXPENSE Total', 'Cash Total'
]

# Convert financial columns to numeric, coercing errors to NaN
for col in financial_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Convert date columns to datetime objects, coercing errors
date_cols = ['Report Recieved Date', 'EDC Fiscal Year Start', 'EDC Fiscal Year End']
for col in date_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

# Check data types again after conversion
print("\nDataFrame Info After Cleaning:")
df.info()

# Display descriptive statistics for numeric columns
print("\nDescriptive Statistics (Numeric):")
df.describe()

# Check for missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18477 entries, 0 to 18476
Data columns (total 33 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Fiscal Year                                18477 non-null  int64  
 1   HQ City                                    18477 non-null  object 
 2   EDC Type Code                              18477 non-null  object 
 3   Corporation Name                           18477 non-null  object 
 4   Report Recieved Date                       17319 non-null  object 
 5   EDC City                                   18469 non-null  object 
 6   EDC Fiscal Year Start                      18477 non-null  object 
 7   EDC Fiscal Year End                        18477 non-null  object 
 8   REVENUE Bonds                              17557 non-null  float64
 9   REVENUE Grants                             17557 non-null  float64
 10  REVENUE Other                              17557 non-null  float64
 11  REVENUE Sales Tax                          17557 non-null  float64
 12  REVENUE Users Fees                         17557 non-null  float64
 13  REVENUE Total                              17557 non-null  float64
 14  EXPENSE Total                              16609 non-null  float64
 15  Cash Total                                 12133 non-null  float64
 16  SOS File No.                               5612 non-null   float64
 17  EDC Street                                 18462 non-null  object 
 18  EDC State                                  18477 non-null  object 
 19  EDC Zip                                    18473 non-null  object 
 20  ASSETS Buildings                           11198 non-null  object 
 21  ASSETS Commercial Buildings                11198 non-null  object 
 22  ASSETS Equipment                           11198 non-null  object 
 23  ASSETS Industrial Parks & Sites            11198 non-null  object 
 24  ASSETS Land                                11198 non-null  object 
 25  ASSETS Other                               11198 non-null  object 
 26  ASSETS Recreational Facilities             11198 non-null  object 
 27  OBJECTIVES Infrastructure                  18349 non-null  object 
 28  OBJECTIVES Job Creation & Job Retention    18349 non-null  object 
 29  OBJECTIVES Sports Facilities & Recreation  18349 non-null  object 
 30  OBJECTIVES Tourism                         18349 non-null  object 
 31  OBJECTIVES Other                           18349 non-null  object 
 32  Location                                   18473 non-null  object 
dtypes: float64(9), int64(1), object(23)
memory usage: 4.7+ MB

DataFrame Head:

DataFrame Info After Cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18477 entries, 0 to 18476
Data columns (total 33 columns):
 #   Column                                     Non-Null Count  Dtype         
---  ------                                     --------------  -----         
 0   Fiscal Year                                18477 non-null  int64         
 1   HQ City                                    18477 non-null  object        
 2   EDC Type Code                              18477 non-null  object        
 3   Corporation Name                           18477 non-null  object        
 4   Report Recieved Date                       17319 non-null  datetime64[ns]
 5   EDC City                                   18469 non-null  object        
 6   EDC Fiscal Year Start                      18463 non-null  datetime64[ns]
 7   EDC Fiscal Year End                        18463 non-null  datetime64[ns]
 8   REVENUE Bonds                              17557 non-null  float64       
 9   REVENUE Grants                             17557 non-null  float64       
 10  REVENUE Other                              17557 non-null  float64       
 11  REVENUE Sales Tax                          17557 non-null  float64       
 12  REVENUE Users Fees                         17557 non-null  float64       
 13  REVENUE Total                              17557 non-null  float64       
 14  EXPENSE Total                              16609 non-null  float64       
 15  Cash Total                                 12133 non-null  float64       
 16  SOS File No.                               5612 non-null   float64       
 17  EDC Street                                 18462 non-null  object        
 18  EDC State                                  18477 non-null  object        
 19  EDC Zip                                    18473 non-null  object        
 20  ASSETS Buildings                           11198 non-null  boolean       
 21  ASSETS Commercial Buildings                11198 non-null  boolean       
 22  ASSETS Equipment                           11198 non-null  boolean       
 23  ASSETS Industrial Parks & Sites            11198 non-null  boolean       
 24  ASSETS Land                                11198 non-null  boolean       
 25  ASSETS Other                               11198 non-null  boolean       
 26  ASSETS Recreational Facilities             11198 non-null  boolean       
 27  OBJECTIVES Infrastructure                  18349 non-null  boolean       
 28  OBJECTIVES Job Creation & Job Retention    18349 non-null  boolean       
 29  OBJECTIVES Sports Facilities & Recreation  18349 non-null  boolean       
 30  OBJECTIVES Tourism                         18349 non-null  boolean       
 31  OBJECTIVES Other                           18349 non-null  boolean       
 32  Location                                   18473 non-null  object        
dtypes: boolean(12), datetime64[ns](3), float64(9), int64(1), object(8)
memory usage: 3.4+ MB

Descriptive Statistics (Numeric):

Missing Values per Column:
Fiscal Year                                      0
HQ City                                          0
EDC Type Code                                    0
Corporation Name                                 0
Report Recieved Date                          1158
EDC City                                         8
EDC Fiscal Year Start                           14
EDC Fiscal Year End                             14
REVENUE Bonds                                  920
REVENUE Grants                                 920
REVENUE Other                                  920
REVENUE Sales Tax                              920
REVENUE Users Fees                             920
REVENUE Total                                  920
EXPENSE Total                                 1868
Cash Total                                    6344
SOS File No.                                 12865
EDC Street                                      15
EDC State                                        0
EDC Zip                                          4
ASSETS Buildings                              7279
ASSETS Commercial Buildings                   7279
ASSETS Equipment                              7279
ASSETS Industrial Parks & Sites               7279
ASSETS Land                                   7279
ASSETS Other                                  7279
ASSETS Recreational Facilities                7279
OBJECTIVES Infrastructure                      128
OBJECTIVES Job Creation & Job Retention        128
OBJECTIVES Sports Facilities & Recreation      128
OBJECTIVES Tourism                             128
OBJECTIVES Other                               128
Location                                         4
dtype: int64

Observations:

Significant missing values exist in financial columns (REVENUE*, EXPENSE Total, Cash Total), SOS File No., and Location.
Date columns and boolean columns have been converted to appropriate types, handling errors.
Financial columns are now numeric.

Distribution of EDC Types¶

In [4]:

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='EDC Type Code', order=df['EDC Type Code'].value_counts().index)
plt.title('Distribution of EDC Types')
plt.xlabel('EDC Type')
plt.ylabel('Count')
plt.show()

No description has been provided for this image

Type B EDCs are more common than Type A in this dataset.

Distribution of Records by Fiscal Year¶

In [5]:

plt.figure(figsize=(15, 6))
sns.countplot(data=df, x='Fiscal Year', order=sorted(df['Fiscal Year'].unique()))
plt.title('Number of EDC Records per Fiscal Year')
plt.xlabel('Fiscal Year')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

The dataset contains records across many years, with a noticeable number of records for recent years (2022, 2023, 2024), although financial data for these recent years is often missing.

Financial Overview (Revenue, Expense, Cash)¶

In [6]:

# Summary statistics for financial columns (excluding NaNs)
print("Financial Summary Statistics (excluding NaNs):")
print(df[financial_cols].describe())

# Visualize distributions for non-missing financial data (using log scale due to potential skewness)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Distribution of Key Financial Metrics (Log Scale, Non-Missing Data)')

# Plot Revenue Total
revenue_data = df['REVENUE Total'].dropna()
if not revenue_data.empty:
    sns.histplot(np.log1p(revenue_data[revenue_data >= 0]), kde=True, ax=axes[0]) # Use log1p for non-negative values
    axes[0].set_title('Log(Revenue Total + 1)')
else:
    axes[0].set_title('Log(Revenue Total + 1) - No Data')

# Plot Expense Total
expense_data = df['EXPENSE Total'].dropna()
if not expense_data.empty:
    sns.histplot(np.log1p(expense_data[expense_data >= 0]), kde=True, ax=axes[1])
    axes[1].set_title('Log(Expense Total + 1)')
else:
    axes[1].set_title('Log(Expense Total + 1) - No Data')

# Plot Cash Total
cash_data = df['Cash Total'].dropna()
if not cash_data.empty:
    # Handle potential negative values before log transformation if necessary, or plot differently
    sns.histplot(np.log1p(cash_data[cash_data >= 0]), kde=True, ax=axes[2])
    axes[2].set_title('Log(Cash Total + 1) (Non-Negative)')
else:
    axes[2].set_title('Log(Cash Total + 1) - No Data')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Calculate percentage of missing financial data
missing_financial = df[financial_cols].isnull().sum() / len(df) * 100
print("\nPercentage of Missing Financial Data:")
print(missing_financial)

Financial Summary Statistics (excluding NaNs):
       REVENUE Bonds  REVENUE Grants  REVENUE Other  REVENUE Sales Tax  \
count   1.755700e+04    1.755700e+04   1.755700e+04       1.755700e+04   
mean    1.889305e+05    1.291572e+04   1.301663e+05       9.525493e+05   
std     1.824203e+06    1.801527e+05   8.522381e+05       2.396700e+06   
min     0.000000e+00    0.000000e+00   0.000000e+00       0.000000e+00   
25%     0.000000e+00    0.000000e+00   9.200000e+01       5.456000e+04   
50%     0.000000e+00    0.000000e+00   3.920000e+03       1.867520e+05   
75%     0.000000e+00    0.000000e+00   3.239200e+04       6.797700e+05   
max     5.400000e+07    1.148762e+07   5.471664e+07       3.492907e+07   

       REVENUE Users Fees  REVENUE Total  EXPENSE Total    Cash Total  
count        1.755700e+04   1.755700e+04   1.660900e+04  1.213300e+04  
mean         2.816633e+04   1.312728e+06   1.191073e+06  1.943933e+06  
std          1.851261e+05   3.909668e+06   3.629177e+06  6.210683e+06  
min          0.000000e+00   2.100000e+01   1.000000e+00 -3.452407e+07  
25%          0.000000e+00   6.382000e+04   4.228500e+04  1.144530e+05  
50%          0.000000e+00   2.242662e+05   1.753750e+05  3.870490e+05  
75%          0.000000e+00   8.291440e+05   7.208850e+05  1.466976e+06  
max          5.500000e+06   1.020953e+08   7.174111e+07  2.917845e+08

Percentage of Missing Financial Data:
REVENUE Bonds          4.979163
REVENUE Grants         4.979163
REVENUE Other          4.979163
REVENUE Sales Tax      4.979163
REVENUE Users Fees     4.979163
REVENUE Total          4.979163
EXPENSE Total         10.109866
Cash Total            34.334578
dtype: float64

Financial data shows wide variation. The high percentage of missing values, especially for REVENUE Bonds, Grants, Other, Users Fees, limits comprehensive financial analysis across all EDCs.

Analysis of Revenue Sources¶

In [7]:

# Calculate total revenue from each source (where data is available)
revenue_sources = ['REVENUE Bonds', 'REVENUE Grants', 'REVENUE Other', 'REVENUE Sales Tax', 'REVENUE Users Fees']
revenue_sums = df[revenue_sources].sum()

plt.figure(figsize=(10, 6))
revenue_sums.plot(kind='bar')
plt.title('Total Revenue by Source (Across All Years with Data)')
plt.ylabel('Total Revenue (Log Scale)')
plt.xlabel('Revenue Source')
plt.yscale('log') # Use log scale due to large differences
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("Total Revenue by Source:")
print(revenue_sums)

Total Revenue by Source:
REVENUE Bonds         3.317054e+09
REVENUE Grants        2.267612e+08
REVENUE Other         2.285330e+09
REVENUE Sales Tax     1.672391e+10
REVENUE Users Fees    4.945163e+08
dtype: float64

Based on the available data, 'REVENUE Sales Tax' is the most significant revenue source overall, followed by 'REVENUE Bonds'. 'REVENUE Grants', 'Other', and 'Users Fees' contribute less in total, but can be significant for individual EDCs or years.

Analysis of EDC Objectives¶

In [8]:

# Sum the boolean objective columns
objective_cols = [
    'OBJECTIVES Infrastructure', 'OBJECTIVES Job Creation & Job Retention',
    'OBJECTIVES Sports Facilities & Recreation', 'OBJECTIVES Tourism', 'OBJECTIVES Other'
]

# Convert boolean columns to numeric (1 for True, 0 for False/NaN) for summing
objective_counts = df[objective_cols].astype(float).sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
objective_counts.plot(kind='bar')
plt.title('Frequency of EDC Objectives')
plt.ylabel('Number of EDCs Listing Objective')
plt.xlabel('Objective')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("Frequency of Objectives:")
print(objective_counts)

Frequency of Objectives:
OBJECTIVES Job Creation & Job Retention      12509.0
OBJECTIVES Infrastructure                    12275.0
OBJECTIVES Sports Facilities & Recreation     5611.0
OBJECTIVES Tourism                            5582.0
OBJECTIVES Other                              2614.0
dtype: float64

'Job Creation & Job Retention' is the most frequently cited objective, followed closely by 'Infrastructure'. 'Tourism', 'Sports Facilities & Recreation', and 'Other' objectives are also common.

Analysis of EDC Assets¶

In [9]:

# Sum the boolean asset columns
asset_cols = [
    'ASSETS Buildings', 'ASSETS Commercial Buildings', 'ASSETS Equipment',
    'ASSETS Industrial Parks & Sites', 'ASSETS Land', 'ASSETS Other',
    'ASSETS Recreational Facilities'
]

# Convert boolean columns to numeric (1 for True, 0 for False/NaN) for summing
asset_counts = df[asset_cols].astype(float).sum().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
asset_counts.plot(kind='bar')
plt.title('Frequency of EDC Asset Types')
plt.ylabel('Number of EDCs Listing Asset')
plt.xlabel('Asset Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("Frequency of Asset Types:")
print(asset_counts)

Frequency of Asset Types:
ASSETS Land                        6931.0
ASSETS Buildings                   3505.0
ASSETS Other                       3393.0
ASSETS Industrial Parks & Sites    2683.0
ASSETS Equipment                   2556.0
ASSETS Commercial Buildings        2430.0
ASSETS Recreational Facilities     1749.0
dtype: float64

'Land' is the most commonly reported asset type, followed by 'Buildings' and 'Industrial Parks & Sites'. Reporting on assets appears less consistent than objectives.

Data Anomalies and Quality Issues¶

In [10]:

# 1. Future Report Received Dates
future_reports = df[df['Report Recieved Date'] > pd.Timestamp.now()]
print(f"\nRecords with Future Report Received Dates ({len(future_reports)}):")
if not future_reports.empty:
    print(future_reports[['Fiscal Year', 'Corporation Name', 'Report Recieved Date']].head())
else:
    print("None found.")

# 2. Negative Cash Totals
negative_cash = df[df['Cash Total'] < 0]
print(f"\nRecords with Negative Cash Totals ({len(negative_cash)}):")
if not negative_cash.empty:
    print(negative_cash[['Fiscal Year', 'Corporation Name', 'Cash Total']].head())
else:
    print("None found.")

# 3. Expenses Significantly Exceeding Revenue (Example: Expenses > 1.5 * Revenue)
# Ensure both columns are numeric and not NaN before comparison
high_expense_ratio = df[
    (df['EXPENSE Total'].notna()) &
    (df['REVENUE Total'].notna()) &
    (df['REVENUE Total'] >= 0) & # Avoid division by zero or negative revenue issues
    (df['EXPENSE Total'] > 1.5 * df['REVENUE Total'])
]
print(f"\nRecords where Expenses > 1.5 * Revenue ({len(high_expense_ratio)}):")
if not high_expense_ratio.empty:
    print(high_expense_ratio[['Fiscal Year', 'Corporation Name', 'REVENUE Total', 'EXPENSE Total']].head())
else:
    print("None found (using 1.5x threshold).")

# 4. Revenue Calculation Discrepancy (Example: Snyder 2022)
snyder_2022 = df[(df['Corporation Name'] == 'Development Corp. Of Snyder') & (df['Fiscal Year'] == 2022)]
print("\nChecking Snyder 2022 Revenue Calculation:")
if not snyder_2022.empty:
    record = snyder_2022.iloc[0]
    calculated_sum = record[['REVENUE Bonds', 'REVENUE Grants', 'REVENUE Other', 'REVENUE Sales Tax', 'REVENUE Users Fees']].sum()
    reported_total = record['REVENUE Total']
    print(f"  Reported Revenue Total: {reported_total}")
    print(f"  Sum of Components: {calculated_sum}")
    print(f"  Discrepancy: {reported_total - calculated_sum}")
else:
    print("Snyder 2022 record not found.")

# 5. Missing Location Data
missing_location = df[df['Location'].isnull()]
print(f"\nRecords with Missing Location Data ({len(missing_location)}):")
if not missing_location.empty:
    print(missing_location[['Fiscal Year', 'Corporation Name', 'HQ City']].head())
else:
    print("None found.")

Records with Future Report Received Dates (0):
None found.

Records with Negative Cash Totals (88):
      Fiscal Year                                  Corporation Name  \
280          2010        Groesbeck Economic Development Corporation   
669          2014  City Of Buda 4b Economic Development Corporation   
1021         2015    Hidalgo Economic Development Corporation, Inc.   
1040         2020        Henderson Economic Development Corporation   
1105         2010           Hamlin Economic Development Corporation   

      Cash Total  
280      -8449.0  
669   -1202262.0  
1021     -1821.0  
1040   -212294.0  
1105    -38000.0  

Records where Expenses > 1.5 * Revenue (1421):
    Fiscal Year                                   Corporation Name  \
21         2017                  Economic Development Corportation   
47         2007         Sour Lake Economic Development Corporation   
70         2009            Lamesa Economic Development Corporation   
76         1998            Newton Economic Development Corporation   
81         2000  Dalworthington Gardens Park & Rec Facility Dev...   

    REVENUE Total  EXPENSE Total  
21        25795.0        62337.0  
47       147218.0       389317.0  
70       471835.0       930672.0  
76        53273.0        82292.0  
81        89835.0       152218.0  

Checking Snyder 2022 Revenue Calculation:
  Reported Revenue Total: 1186003.0
  Sum of Components: 1186003.0
  Discrepancy: 0.0

Records with Missing Location Data (4):
       Fiscal Year                         Corporation Name HQ City
7             2022  Desoto Economic Development Corporation  Desoto
34            2024  Desoto Economic Development Corporation  Desoto
8310          2023  Desoto Economic Development Corporation  Desoto
14496         2021  Desoto Economic Development Corporation  Desoto

Several data quality issues are present:

Future dates for report reception suggest placeholders or errors.
Negative cash balances exist, which is financially unusual.
Numerous instances where expenses heavily outweigh revenues, possibly due to large investments or data errors.
Confirmed calculation discrepancy for Snyder 2022 revenue.
Some records lack geographic location data.

Predictive Analysis Feasibility¶

Status: Not Applicable

Reason: Predictive analysis (e.g., forecasting revenue or expenses) is not reliably feasible with the current dataset due to significant missing financial data, particularly for recent years, and the lack of consistent time-series data for many EDCs. EDC financial performance is also heavily influenced by local economic factors, specific projects, and policy decisions not captured in this dataset.

Summary of Findings¶

The dataset provides a snapshot of Texas Economic Development Corporations (EDCs) from 1997-2024, detailing their type (mostly B then A), location, objectives (commonly Job Creation/Retention and Infrastructure), and some asset information (commonly Land). However, the utility for financial trend analysis is limited by substantial missing data, especially for recent fiscal years (2023-2024). Key anomalies include future-dated reports, negative cash balances in some records, calculation inconsistencies, and frequent instances of expenses exceeding revenues, suggesting potential data quality issues or specific operational contexts requiring further investigation. Robust predictive modeling is not recommended due to data sparsity and external influencing factors not present in the data.