Bus Stop Data Analysis - Královéhradecký kraj¶
This notebook performs an exploratory data analysis (EDA) on a dataset containing information about bus stops in the Královéhradecký kraj region of the Czech Republic. The analysis includes data loading, cleaning, visualization, and identification of key insights and potential anomalies. Due to lack of temporal or related data, predictive analysis is not applicable. Further analysis could focus on spatial distribution, clustering of stops, or relationships between bus stop locations and population density with external data.
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import skew
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
# Load the CSV data into a Pandas DataFrame
df = pd.read_csv('autobusové_zastávky_iredo_-1065211480630538261.csv', encoding='utf-8')
Initial Data Inspection¶
df.head()
Název | Označení | Název vyššího územního samosprávného celku | Kód vyššího územního samosprávného celku dle číselníku ČSÚ | Název správního obvodu obce s rozšířenou působností | Kód správního obvodu obce s rozšířenou působností dle číselníku ČSÚ | Název okresu | Kód okresu dle číselníku ČSÚ | Název obce | Kód obce dle číselníku ČSÚ | Zápis vektorové geometrie | Zeměpisná délka v souřadnicovém systému WGS84 | Zeměpisná šířka v souřadnicovém systému WGS84 | Jedinečný identifikátor v katalogu otevřených dat Data KHK | ID | x2 | y2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adršpach,Dolní Adršpach,zámek | 38 | Královéhradecký kraj | CZ052 | Broumov | 5201 | Náchod | CZ0523 | Adršpach | 547786 | POINT(50.618616 16.108505) | 16.108505 | 50.618616 | AZI1 | 1 | 1.793191e+06 | 6.554107e+06 |
1 | Adršpach,Dolní Adršpach,zámek | 38 | Královéhradecký kraj | CZ052 | Broumov | 5201 | Náchod | CZ0523 | Adršpach | 547786 | POINT(50.618711 16.108444) | 16.108444 | 50.618711 | AZI2 | 2 | 1.793184e+06 | 6.554123e+06 |
2 | Adršpach,Dolní Adršpach,odb.Zdoňov | 39 | Královéhradecký kraj | CZ052 | Broumov | 5201 | Náchod | CZ0523 | Adršpach | 547786 | POINT(50.616431 16.132747) | 16.132747 | 50.616431 | AZI3 | 3 | 1.795889e+06 | 6.553723e+06 |
3 | Adršpach,Dolní Adršpach,odb.Zdoňov | 39 | Královéhradecký kraj | CZ052 | Broumov | 5201 | Náchod | CZ0523 | Adršpach | 547786 | POINT(50.616562 16.132786) | 16.132786 | 50.616562 | AZI4 | 4 | 1.795894e+06 | 6.553746e+06 |
4 | Adršpach,Horní Adršpach,žel.zast. | 40 | Královéhradecký kraj | CZ052 | Broumov | 5201 | Náchod | CZ0523 | Adršpach | 547786 | POINT(50.624189 16.083921) | 16.083921 | 50.624189 | AZI5 | 5 | 1.790454e+06 | 6.555084e+06 |
Data Cleaning - Drop Duplicates¶
#remove duplicated entries
df = df.drop_duplicates()
df = df.reset_index(drop=True)
Data Cleaning - Clean the location column to extract town, part and name¶
# Split the 'Název' column into town, part, and specific location
df[['town', 'part', 'specific_location']] = df['Název'].str.split(',', expand=True)
# Fill missing 'part' and 'specific_location' values with empty strings
df['part'] = df['part'].fillna('')
df['specific_location'] = df['specific_location'].fillna('')
Visualize Bus Stop Distribution by Municipality¶
# Create a bar plot of the top municipalities with the most bus stops
plt.figure(figsize=(14, 6))
df['Název obce'].value_counts().nlargest(20).plot(kind='bar', color='skyblue')
plt.title('Top 20 Municipalities by Number of Bus Stops')
plt.xlabel('Municipality Name')
plt.ylabel('Number of Bus Stops')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Bus Stop Density Heatmap¶
# Create a heatmap of bus stop density
plt.figure(figsize=(10, 8))
sns.kdeplot(x=df['Zeměpisná délka v souřadnicovém systému WGS84'], y=df['Zeměpisná šířka v souřadnicovém systému WGS84'], fill=True, cmap='viridis')
plt.title('Bus Stop Density Heatmap')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
Data Insight: Common Location Descriptors¶
# Display the 5 most common values from 'specific_location'
print(df['specific_location'].value_counts().nlargest(5).to_frame())
count specific_location 768 ObÚ 156 škola 125 aut.st. 119 odb. 99
Summary of Findings¶
Based on the analysis performed on the provided dataset the following findings and summary were obtained:
Insights:
- The dataset contains bus stop information primarily from the Královéhradecký kraj region of the Czech Republic.
- Many bus stops are represented by two entries in the dataset, likely representing opposite directions of travel.
- Several locations have multiple bus stops with slightly varying names and locations, indicating possible route variations or multiple platforms.
- The coordinates x2 and y2 appear to be a transformation or alternate representation of the longitude and latitude.
- The 'Název' field often contains location descriptors like 'ObÚ', 'žel.st.', 'nám.', suggesting important landmarks nearby the bus stops.
Anomalies:
- Inconsistency in naming conventions within the 'Název' field (e.g., "Adršpach,Dolní Adršpach,zámek" vs "Nová Ves,,Kvartýr").
- Some 'Název' values are missing a key descriptor (e.g., "Bašnice,559,..." where the second field is empty).
- The dataset contains duplicated entries for the same bus stop (e.g. first two entries)
- There's an AZI13 entry after AZI429.
Predictions: The dataset primarily describes bus stop locations and characteristics. There are no time-series data or other elements that lend themselves to time-based predictions. Simple predictive location analysis based on density isn't meaningful without external data (e.g. population, time of day).
Summary: This CSV data represents a catalog of bus stops within the Královéhradecký kraj, detailing their names, locations (latitude, longitude, and transformed x2/y2 coordinates), administrative region, and unique identifiers. Data quality issues exist in naming conventions. Due to a lack of temporal or related data, predictive analysis is not applicable. Further analysis could focus on spatial distribution, clustering of stops, or relationships between bus stop locations and population density with external data.