Economic Development Corporation (EDC) Data Analysis¶
Purpose: This notebook analyzes data from the edc_data_big_table.csv
file, which contains information about various Economic Development Corporations (EDCs) primarily in Texas. The analysis aims to:
- Understand the characteristics of EDCs (Type, Location, Fiscal Year).
- Explore their financial performance (Revenue sources, Expenses, Cash holdings), acknowledging data limitations.
- Identify common objectives and assets held by these corporations.
- Visualize key trends and distributions.
- Highlight data quality issues, anomalies, and insights derived from the dataset.
- Assess the feasibility of predictive modeling based on the available data.
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# Ignore common warnings
warnings.filterwarnings('ignore')
# Set plot style
sns.set_style('whitegrid')
# Load Data
df = pd.read_csv('edc_data_big_table.csv')
Initial Data Exploration and Cleaning¶
# Display basic information and data types
print("DataFrame Info:")
df.info()
# Display first few rows
print("\nDataFrame Head:")
df.head()
# Identify columns with boolean-like string values ('true'/'false')
bool_cols = [
'ASSETS Buildings', 'ASSETS Commercial Buildings', 'ASSETS Equipment',
'ASSETS Industrial Parks & Sites', 'ASSETS Land', 'ASSETS Other',
'ASSETS Recreational Facilities', 'OBJECTIVES Infrastructure',
'OBJECTIVES Job Creation & Job Retention', 'OBJECTIVES Sports Facilities & Recreation',
'OBJECTIVES Tourism', 'OBJECTIVES Other'
]
# Convert boolean-like columns to actual booleans (True/False/NaN)
for col in bool_cols:
if col in df.columns:
# Map 'true' to True, 'false' to False, others to NaN
df[col] = df[col].astype(str).str.lower().map({'true': True, 'false': False}).astype('boolean')
# Identify financial columns
financial_cols = [
'REVENUE Bonds', 'REVENUE Grants', 'REVENUE Other', 'REVENUE Sales Tax',
'REVENUE Users Fees', 'REVENUE Total', 'EXPENSE Total', 'Cash Total'
]
# Convert financial columns to numeric, coercing errors to NaN
for col in financial_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Convert date columns to datetime objects, coercing errors
date_cols = ['Report Recieved Date', 'EDC Fiscal Year Start', 'EDC Fiscal Year End']
for col in date_cols:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
# Check data types again after conversion
print("\nDataFrame Info After Cleaning:")
df.info()
# Display descriptive statistics for numeric columns
print("\nDescriptive Statistics (Numeric):")
df.describe()
# Check for missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())
DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 18477 entries, 0 to 18476 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Fiscal Year 18477 non-null int64 1 HQ City 18477 non-null object 2 EDC Type Code 18477 non-null object 3 Corporation Name 18477 non-null object 4 Report Recieved Date 17319 non-null object 5 EDC City 18469 non-null object 6 EDC Fiscal Year Start 18477 non-null object 7 EDC Fiscal Year End 18477 non-null object 8 REVENUE Bonds 17557 non-null float64 9 REVENUE Grants 17557 non-null float64 10 REVENUE Other 17557 non-null float64 11 REVENUE Sales Tax 17557 non-null float64 12 REVENUE Users Fees 17557 non-null float64 13 REVENUE Total 17557 non-null float64 14 EXPENSE Total 16609 non-null float64 15 Cash Total 12133 non-null float64 16 SOS File No. 5612 non-null float64 17 EDC Street 18462 non-null object 18 EDC State 18477 non-null object 19 EDC Zip 18473 non-null object 20 ASSETS Buildings 11198 non-null object 21 ASSETS Commercial Buildings 11198 non-null object 22 ASSETS Equipment 11198 non-null object 23 ASSETS Industrial Parks & Sites 11198 non-null object 24 ASSETS Land 11198 non-null object 25 ASSETS Other 11198 non-null object 26 ASSETS Recreational Facilities 11198 non-null object 27 OBJECTIVES Infrastructure 18349 non-null object 28 OBJECTIVES Job Creation & Job Retention 18349 non-null object 29 OBJECTIVES Sports Facilities & Recreation 18349 non-null object 30 OBJECTIVES Tourism 18349 non-null object 31 OBJECTIVES Other 18349 non-null object 32 Location 18473 non-null object dtypes: float64(9), int64(1), object(23) memory usage: 4.7+ MB DataFrame Head:
DataFrame Info After Cleaning: <class 'pandas.core.frame.DataFrame'> RangeIndex: 18477 entries, 0 to 18476 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Fiscal Year 18477 non-null int64 1 HQ City 18477 non-null object 2 EDC Type Code 18477 non-null object 3 Corporation Name 18477 non-null object 4 Report Recieved Date 17319 non-null datetime64[ns] 5 EDC City 18469 non-null object 6 EDC Fiscal Year Start 18463 non-null datetime64[ns] 7 EDC Fiscal Year End 18463 non-null datetime64[ns] 8 REVENUE Bonds 17557 non-null float64 9 REVENUE Grants 17557 non-null float64 10 REVENUE Other 17557 non-null float64 11 REVENUE Sales Tax 17557 non-null float64 12 REVENUE Users Fees 17557 non-null float64 13 REVENUE Total 17557 non-null float64 14 EXPENSE Total 16609 non-null float64 15 Cash Total 12133 non-null float64 16 SOS File No. 5612 non-null float64 17 EDC Street 18462 non-null object 18 EDC State 18477 non-null object 19 EDC Zip 18473 non-null object 20 ASSETS Buildings 11198 non-null boolean 21 ASSETS Commercial Buildings 11198 non-null boolean 22 ASSETS Equipment 11198 non-null boolean 23 ASSETS Industrial Parks & Sites 11198 non-null boolean 24 ASSETS Land 11198 non-null boolean 25 ASSETS Other 11198 non-null boolean 26 ASSETS Recreational Facilities 11198 non-null boolean 27 OBJECTIVES Infrastructure 18349 non-null boolean 28 OBJECTIVES Job Creation & Job Retention 18349 non-null boolean 29 OBJECTIVES Sports Facilities & Recreation 18349 non-null boolean 30 OBJECTIVES Tourism 18349 non-null boolean 31 OBJECTIVES Other 18349 non-null boolean 32 Location 18473 non-null object dtypes: boolean(12), datetime64[ns](3), float64(9), int64(1), object(8) memory usage: 3.4+ MB Descriptive Statistics (Numeric): Missing Values per Column: Fiscal Year 0 HQ City 0 EDC Type Code 0 Corporation Name 0 Report Recieved Date 1158 EDC City 8 EDC Fiscal Year Start 14 EDC Fiscal Year End 14 REVENUE Bonds 920 REVENUE Grants 920 REVENUE Other 920 REVENUE Sales Tax 920 REVENUE Users Fees 920 REVENUE Total 920 EXPENSE Total 1868 Cash Total 6344 SOS File No. 12865 EDC Street 15 EDC State 0 EDC Zip 4 ASSETS Buildings 7279 ASSETS Commercial Buildings 7279 ASSETS Equipment 7279 ASSETS Industrial Parks & Sites 7279 ASSETS Land 7279 ASSETS Other 7279 ASSETS Recreational Facilities 7279 OBJECTIVES Infrastructure 128 OBJECTIVES Job Creation & Job Retention 128 OBJECTIVES Sports Facilities & Recreation 128 OBJECTIVES Tourism 128 OBJECTIVES Other 128 Location 4 dtype: int64
Observations:
- Significant missing values exist in financial columns (
REVENUE*
,EXPENSE Total
,Cash Total
),SOS File No.
, andLocation
. - Date columns and boolean columns have been converted to appropriate types, handling errors.
- Financial columns are now numeric.
Distribution of EDC Types¶
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='EDC Type Code', order=df['EDC Type Code'].value_counts().index)
plt.title('Distribution of EDC Types')
plt.xlabel('EDC Type')
plt.ylabel('Count')
plt.show()
Type B EDCs are more common than Type A in this dataset.
Distribution of Records by Fiscal Year¶
plt.figure(figsize=(15, 6))
sns.countplot(data=df, x='Fiscal Year', order=sorted(df['Fiscal Year'].unique()))
plt.title('Number of EDC Records per Fiscal Year')
plt.xlabel('Fiscal Year')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
The dataset contains records across many years, with a noticeable number of records for recent years (2022, 2023, 2024), although financial data for these recent years is often missing.
Financial Overview (Revenue, Expense, Cash)¶
# Summary statistics for financial columns (excluding NaNs)
print("Financial Summary Statistics (excluding NaNs):")
print(df[financial_cols].describe())
# Visualize distributions for non-missing financial data (using log scale due to potential skewness)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Distribution of Key Financial Metrics (Log Scale, Non-Missing Data)')
# Plot Revenue Total
revenue_data = df['REVENUE Total'].dropna()
if not revenue_data.empty:
sns.histplot(np.log1p(revenue_data[revenue_data >= 0]), kde=True, ax=axes[0]) # Use log1p for non-negative values
axes[0].set_title('Log(Revenue Total + 1)')
else:
axes[0].set_title('Log(Revenue Total + 1) - No Data')
# Plot Expense Total
expense_data = df['EXPENSE Total'].dropna()
if not expense_data.empty:
sns.histplot(np.log1p(expense_data[expense_data >= 0]), kde=True, ax=axes[1])
axes[1].set_title('Log(Expense Total + 1)')
else:
axes[1].set_title('Log(Expense Total + 1) - No Data')
# Plot Cash Total
cash_data = df['Cash Total'].dropna()
if not cash_data.empty:
# Handle potential negative values before log transformation if necessary, or plot differently
sns.histplot(np.log1p(cash_data[cash_data >= 0]), kde=True, ax=axes[2])
axes[2].set_title('Log(Cash Total + 1) (Non-Negative)')
else:
axes[2].set_title('Log(Cash Total + 1) - No Data')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# Calculate percentage of missing financial data
missing_financial = df[financial_cols].isnull().sum() / len(df) * 100
print("\nPercentage of Missing Financial Data:")
print(missing_financial)
Financial Summary Statistics (excluding NaNs): REVENUE Bonds REVENUE Grants REVENUE Other REVENUE Sales Tax \ count 1.755700e+04 1.755700e+04 1.755700e+04 1.755700e+04 mean 1.889305e+05 1.291572e+04 1.301663e+05 9.525493e+05 std 1.824203e+06 1.801527e+05 8.522381e+05 2.396700e+06 min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 25% 0.000000e+00 0.000000e+00 9.200000e+01 5.456000e+04 50% 0.000000e+00 0.000000e+00 3.920000e+03 1.867520e+05 75% 0.000000e+00 0.000000e+00 3.239200e+04 6.797700e+05 max 5.400000e+07 1.148762e+07 5.471664e+07 3.492907e+07 REVENUE Users Fees REVENUE Total EXPENSE Total Cash Total count 1.755700e+04 1.755700e+04 1.660900e+04 1.213300e+04 mean 2.816633e+04 1.312728e+06 1.191073e+06 1.943933e+06 std 1.851261e+05 3.909668e+06 3.629177e+06 6.210683e+06 min 0.000000e+00 2.100000e+01 1.000000e+00 -3.452407e+07 25% 0.000000e+00 6.382000e+04 4.228500e+04 1.144530e+05 50% 0.000000e+00 2.242662e+05 1.753750e+05 3.870490e+05 75% 0.000000e+00 8.291440e+05 7.208850e+05 1.466976e+06 max 5.500000e+06 1.020953e+08 7.174111e+07 2.917845e+08
Percentage of Missing Financial Data: REVENUE Bonds 4.979163 REVENUE Grants 4.979163 REVENUE Other 4.979163 REVENUE Sales Tax 4.979163 REVENUE Users Fees 4.979163 REVENUE Total 4.979163 EXPENSE Total 10.109866 Cash Total 34.334578 dtype: float64
Financial data shows wide variation. The high percentage of missing values, especially for REVENUE Bonds
, Grants
, Other
, Users Fees
, limits comprehensive financial analysis across all EDCs.
Analysis of Revenue Sources¶
# Calculate total revenue from each source (where data is available)
revenue_sources = ['REVENUE Bonds', 'REVENUE Grants', 'REVENUE Other', 'REVENUE Sales Tax', 'REVENUE Users Fees']
revenue_sums = df[revenue_sources].sum()
plt.figure(figsize=(10, 6))
revenue_sums.plot(kind='bar')
plt.title('Total Revenue by Source (Across All Years with Data)')
plt.ylabel('Total Revenue (Log Scale)')
plt.xlabel('Revenue Source')
plt.yscale('log') # Use log scale due to large differences
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("Total Revenue by Source:")
print(revenue_sums)
Total Revenue by Source: REVENUE Bonds 3.317054e+09 REVENUE Grants 2.267612e+08 REVENUE Other 2.285330e+09 REVENUE Sales Tax 1.672391e+10 REVENUE Users Fees 4.945163e+08 dtype: float64
Based on the available data, 'REVENUE Sales Tax' is the most significant revenue source overall, followed by 'REVENUE Bonds'. 'REVENUE Grants', 'Other', and 'Users Fees' contribute less in total, but can be significant for individual EDCs or years.
Analysis of EDC Objectives¶
# Sum the boolean objective columns
objective_cols = [
'OBJECTIVES Infrastructure', 'OBJECTIVES Job Creation & Job Retention',
'OBJECTIVES Sports Facilities & Recreation', 'OBJECTIVES Tourism', 'OBJECTIVES Other'
]
# Convert boolean columns to numeric (1 for True, 0 for False/NaN) for summing
objective_counts = df[objective_cols].astype(float).sum().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
objective_counts.plot(kind='bar')
plt.title('Frequency of EDC Objectives')
plt.ylabel('Number of EDCs Listing Objective')
plt.xlabel('Objective')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("Frequency of Objectives:")
print(objective_counts)
Frequency of Objectives: OBJECTIVES Job Creation & Job Retention 12509.0 OBJECTIVES Infrastructure 12275.0 OBJECTIVES Sports Facilities & Recreation 5611.0 OBJECTIVES Tourism 5582.0 OBJECTIVES Other 2614.0 dtype: float64
'Job Creation & Job Retention' is the most frequently cited objective, followed closely by 'Infrastructure'. 'Tourism', 'Sports Facilities & Recreation', and 'Other' objectives are also common.
Analysis of EDC Assets¶
# Sum the boolean asset columns
asset_cols = [
'ASSETS Buildings', 'ASSETS Commercial Buildings', 'ASSETS Equipment',
'ASSETS Industrial Parks & Sites', 'ASSETS Land', 'ASSETS Other',
'ASSETS Recreational Facilities'
]
# Convert boolean columns to numeric (1 for True, 0 for False/NaN) for summing
asset_counts = df[asset_cols].astype(float).sum().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
asset_counts.plot(kind='bar')
plt.title('Frequency of EDC Asset Types')
plt.ylabel('Number of EDCs Listing Asset')
plt.xlabel('Asset Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("Frequency of Asset Types:")
print(asset_counts)
Frequency of Asset Types: ASSETS Land 6931.0 ASSETS Buildings 3505.0 ASSETS Other 3393.0 ASSETS Industrial Parks & Sites 2683.0 ASSETS Equipment 2556.0 ASSETS Commercial Buildings 2430.0 ASSETS Recreational Facilities 1749.0 dtype: float64
'Land' is the most commonly reported asset type, followed by 'Buildings' and 'Industrial Parks & Sites'. Reporting on assets appears less consistent than objectives.
Data Anomalies and Quality Issues¶
# 1. Future Report Received Dates
future_reports = df[df['Report Recieved Date'] > pd.Timestamp.now()]
print(f"\nRecords with Future Report Received Dates ({len(future_reports)}):")
if not future_reports.empty:
print(future_reports[['Fiscal Year', 'Corporation Name', 'Report Recieved Date']].head())
else:
print("None found.")
# 2. Negative Cash Totals
negative_cash = df[df['Cash Total'] < 0]
print(f"\nRecords with Negative Cash Totals ({len(negative_cash)}):")
if not negative_cash.empty:
print(negative_cash[['Fiscal Year', 'Corporation Name', 'Cash Total']].head())
else:
print("None found.")
# 3. Expenses Significantly Exceeding Revenue (Example: Expenses > 1.5 * Revenue)
# Ensure both columns are numeric and not NaN before comparison
high_expense_ratio = df[
(df['EXPENSE Total'].notna()) &
(df['REVENUE Total'].notna()) &
(df['REVENUE Total'] >= 0) & # Avoid division by zero or negative revenue issues
(df['EXPENSE Total'] > 1.5 * df['REVENUE Total'])
]
print(f"\nRecords where Expenses > 1.5 * Revenue ({len(high_expense_ratio)}):")
if not high_expense_ratio.empty:
print(high_expense_ratio[['Fiscal Year', 'Corporation Name', 'REVENUE Total', 'EXPENSE Total']].head())
else:
print("None found (using 1.5x threshold).")
# 4. Revenue Calculation Discrepancy (Example: Snyder 2022)
snyder_2022 = df[(df['Corporation Name'] == 'Development Corp. Of Snyder') & (df['Fiscal Year'] == 2022)]
print("\nChecking Snyder 2022 Revenue Calculation:")
if not snyder_2022.empty:
record = snyder_2022.iloc[0]
calculated_sum = record[['REVENUE Bonds', 'REVENUE Grants', 'REVENUE Other', 'REVENUE Sales Tax', 'REVENUE Users Fees']].sum()
reported_total = record['REVENUE Total']
print(f" Reported Revenue Total: {reported_total}")
print(f" Sum of Components: {calculated_sum}")
print(f" Discrepancy: {reported_total - calculated_sum}")
else:
print("Snyder 2022 record not found.")
# 5. Missing Location Data
missing_location = df[df['Location'].isnull()]
print(f"\nRecords with Missing Location Data ({len(missing_location)}):")
if not missing_location.empty:
print(missing_location[['Fiscal Year', 'Corporation Name', 'HQ City']].head())
else:
print("None found.")
Records with Future Report Received Dates (0): None found. Records with Negative Cash Totals (88): Fiscal Year Corporation Name \ 280 2010 Groesbeck Economic Development Corporation 669 2014 City Of Buda 4b Economic Development Corporation 1021 2015 Hidalgo Economic Development Corporation, Inc. 1040 2020 Henderson Economic Development Corporation 1105 2010 Hamlin Economic Development Corporation Cash Total 280 -8449.0 669 -1202262.0 1021 -1821.0 1040 -212294.0 1105 -38000.0 Records where Expenses > 1.5 * Revenue (1421): Fiscal Year Corporation Name \ 21 2017 Economic Development Corportation 47 2007 Sour Lake Economic Development Corporation 70 2009 Lamesa Economic Development Corporation 76 1998 Newton Economic Development Corporation 81 2000 Dalworthington Gardens Park & Rec Facility Dev... REVENUE Total EXPENSE Total 21 25795.0 62337.0 47 147218.0 389317.0 70 471835.0 930672.0 76 53273.0 82292.0 81 89835.0 152218.0 Checking Snyder 2022 Revenue Calculation: Reported Revenue Total: 1186003.0 Sum of Components: 1186003.0 Discrepancy: 0.0 Records with Missing Location Data (4): Fiscal Year Corporation Name HQ City 7 2022 Desoto Economic Development Corporation Desoto 34 2024 Desoto Economic Development Corporation Desoto 8310 2023 Desoto Economic Development Corporation Desoto 14496 2021 Desoto Economic Development Corporation Desoto
Several data quality issues are present:
- Future dates for report reception suggest placeholders or errors.
- Negative cash balances exist, which is financially unusual.
- Numerous instances where expenses heavily outweigh revenues, possibly due to large investments or data errors.
- Confirmed calculation discrepancy for Snyder 2022 revenue.
- Some records lack geographic location data.
Predictive Analysis Feasibility¶
Status: Not Applicable
Reason: Predictive analysis (e.g., forecasting revenue or expenses) is not reliably feasible with the current dataset due to significant missing financial data, particularly for recent years, and the lack of consistent time-series data for many EDCs. EDC financial performance is also heavily influenced by local economic factors, specific projects, and policy decisions not captured in this dataset.
Summary of Findings¶
The dataset provides a snapshot of Texas Economic Development Corporations (EDCs) from 1997-2024, detailing their type (mostly B then A), location, objectives (commonly Job Creation/Retention and Infrastructure), and some asset information (commonly Land). However, the utility for financial trend analysis is limited by substantial missing data, especially for recent fiscal years (2023-2024). Key anomalies include future-dated reports, negative cash balances in some records, calculation inconsistencies, and frequent instances of expenses exceeding revenues, suggesting potential data quality issues or specific operational contexts requiring further investigation. Robust predictive modeling is not recommended due to data sparsity and external influencing factors not present in the data.