Analysis Purpose¶
This Jupyter Notebook aims to analyze the provided border crossing entry data. The primary objectives are to:
- Understand the overall distribution of traffic across different borders (US-Canada, US-Mexico).
- Identify the busiest ports and types of measures (e.g., trucks, personal vehicles, pedestrians).
- Observe monthly trends in traffic volumes from January to April 2024.
- Detect any significant anomalies or unusual patterns within the dataset.
- Provide qualitative insights into future traffic trends based on the limited available data.
### Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
### Load Data
df = pd.read_csv('border_crossing_entry_data.csv')
# Display basic information and first few rows
print("DataFrame Info:")
df.info()
print("\nFirst 5 rows of the DataFrame:")
display(df.head())
# Data Preprocessing
# Convert 'Date' column to datetime objects. Using errors='coerce' to handle any parsing issues gracefully.
df['Date'] = pd.to_datetime(df['Date'], format='%b %Y', errors='coerce')
# Drop rows where 'Date' conversion failed (if any)
df.dropna(subset=['Date'], inplace=True)
# Extract Month and Year for easier analysis
df['Month_Num'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
print("\nDataFrame after Date conversion and feature extraction (first 5 rows):")
display(df.head())
DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 401566 entries, 0 to 401565 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Port Name 401566 non-null object 1 State 401566 non-null object 2 Port Code 401566 non-null int64 3 Border 401566 non-null object 4 Date 401566 non-null object 5 Measure 401566 non-null object 6 Value 401566 non-null int64 7 Latitude 401566 non-null float64 8 Longitude 401566 non-null float64 9 Point 401566 non-null object dtypes: float64(2), int64(2), object(6) memory usage: 30.6+ MB First 5 rows of the DataFrame:
Port Name | State | Port Code | Border | Date | Measure | Value | Latitude | Longitude | Point | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Jackman | Maine | 104 | US-Canada Border | Jan 2024 | Trucks | 6556 | 45.806 | -70.397 | POINT (-70.396722 45.805661) |
1 | Porthill | Idaho | 3308 | US-Canada Border | Apr 2024 | Trucks | 98 | 49.000 | -116.499 | POINT (-116.49925 48.999861) |
2 | San Luis | Arizona | 2608 | US-Mexico Border | Apr 2024 | Buses | 10 | 32.485 | -114.782 | POINT (-114.7822222 32.485) |
3 | Willow Creek | Montana | 3325 | US-Canada Border | Jan 2024 | Pedestrians | 2 | 49.000 | -109.731 | POINT (-109.731333 48.999972) |
4 | Warroad | Minnesota | 3423 | US-Canada Border | Jan 2024 | Personal Vehicle Passengers | 9266 | 48.999 | -95.377 | POINT (-95.376555 48.999) |
DataFrame after Date conversion and feature extraction (first 5 rows):
Port Name | State | Port Code | Border | Date | Measure | Value | Latitude | Longitude | Point | Month_Num | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jackman | Maine | 104 | US-Canada Border | 2024-01-01 | Trucks | 6556 | 45.806 | -70.397 | POINT (-70.396722 45.805661) | 1 | 2024 |
1 | Porthill | Idaho | 3308 | US-Canada Border | 2024-04-01 | Trucks | 98 | 49.000 | -116.499 | POINT (-116.49925 48.999861) | 4 | 2024 |
2 | San Luis | Arizona | 2608 | US-Mexico Border | 2024-04-01 | Buses | 10 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 4 | 2024 |
3 | Willow Creek | Montana | 3325 | US-Canada Border | 2024-01-01 | Pedestrians | 2 | 49.000 | -109.731 | POINT (-109.731333 48.999972) | 1 | 2024 |
4 | Warroad | Minnesota | 3423 | US-Canada Border | 2024-01-01 | Personal Vehicle Passengers | 9266 | 48.999 | -95.377 | POINT (-95.376555 48.999) | 1 | 2024 |
Overall Traffic Distribution by Border¶
This section visualizes the total traffic volume across the US-Canada and US-Mexico borders, broken down by the type of measure. It highlights the dominance of the US-Mexico border for passenger and pedestrian traffic.
border_measure_traffic = df.groupby(['Border', 'Measure'])['Value'].sum().unstack(fill_value=0)
plt.figure(figsize=(14, 8))
border_measure_traffic.plot(kind='bar', figsize=(14, 8), width=0.8)
plt.title('Total Traffic Volume by Border and Measure (Jan-Apr 2024)')
plt.xlabel('Border')
plt.ylabel('Total Value')
plt.xticks(rotation=0)
plt.legend(title='Measure', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
print("\nTotal traffic by Border and Measure (sum of values):")
display(border_measure_traffic)
<Figure size 1400x800 with 0 Axes>
Total traffic by Border and Measure (sum of values):
Measure | Bus Passengers | Buses | Pedestrians | Personal Vehicle Passengers | Personal Vehicles | Rail Containers Empty | Rail Containers Loaded | Train Passengers | Trains | Truck Containers Empty | Truck Containers Loaded | Trucks |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Border | ||||||||||||
US-Canada Border | 78809530 | 3309376 | 14253830 | 1721469157 | 832412380 | 17206899 | 40788569 | 6935872 | 827987 | 32843318 | 138679458 | 176191904 |
US-Mexico Border | 79346745 | 6138638 | 1255015955 | 4655913046 | 2244727668 | 12531410 | 11033869 | 355553 | 269367 | 53057822 | 108749142 | 154900644 |
Top 10 Busiest Ports¶
Identifying the ports with the highest overall traffic volume, irrespective of the measure type. This helps pinpoint key commercial and passenger corridors.
top_ports = df.groupby('Port Name')['Value'].sum().nlargest(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_ports.index, y=top_ports.values, palette='viridis')
plt.title('Top 10 Busiest Ports by Total Traffic Volume (Jan-Apr 2024)')
plt.xlabel('Port Name')
plt.ylabel('Total Value')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("\nTop 10 Busiest Ports:")
display(top_ports)
Top 10 Busiest Ports:
Port Name San Ysidro 1412880589 El Paso 1309311993 Laredo 792421597 Hidalgo 663277152 Calexico 617194489 Buffalo Niagara Falls 613429827 Brownsville 609485667 Otay Mesa 572142515 Detroit 548968202 Nogales 481825531 Name: Value, dtype: int64
Monthly Traffic Trends (Jan-Apr 2024)¶
Analyzing how key traffic measures (Personal Vehicle Passengers, Trucks, Pedestrians) have evolved month-over-month for the available period. This provides insight into short-term trends.
monthly_trends = df.groupby(['Date', 'Measure'])['Value'].sum().unstack(fill_value=0)
# Filter for the specific measures of interest
measures_to_plot = ['Personal Vehicle Passengers', 'Trucks', 'Pedestrians']
filtered_monthly_trends = monthly_trends[measures_to_plot] if all(m in monthly_trends.columns for m in measures_to_plot) else monthly_trends.filter(items=measures_to_plot)
plt.figure(figsize=(15, 8))
sns.lineplot(data=filtered_monthly_trends)
plt.title('Monthly Traffic Trends for Key Measures (Jan-Apr 2024)')
plt.xlabel('Date')
plt.ylabel('Total Value')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title='Measure')
plt.tight_layout()
plt.show()
print("\nMonthly trends for key measures:")
display(filtered_monthly_trends)
Monthly trends for key measures:
Measure | Personal Vehicle Passengers | Trucks | Pedestrians |
---|---|---|---|
Date | |||
1996-01-01 | 20181055 | 674351 | 3138859 |
1996-02-01 | 20095676 | 692353 | 2994503 |
1996-03-01 | 21878001 | 719027 | 3508484 |
1996-04-01 | 22654384 | 690452 | 3085717 |
1996-05-01 | 23919187 | 758423 | 2945140 |
... | ... | ... | ... |
2025-01-01 | 14117878 | 1096799 | 3485171 |
2025-02-01 | 12243103 | 1019365 | 3139801 |
2025-03-01 | 13733978 | 1154106 | 3261908 |
2025-04-01 | 13526448 | 1060690 | 3632687 |
2025-05-01 | 14153636 | 1089726 | 3653849 |
353 rows × 3 columns
Distribution of Key Measures¶
Visualizing the distribution of values for different types of measures to understand their typical ranges and identify potential outliers. A logarithmic scale is used due to the wide range of values.
plt.figure(figsize=(15, 8))
sns.boxplot(x='Measure', y='Value', data=df, palette='pastel')
plt.title('Distribution of Values Across Different Measures')
plt.xlabel('Measure Type')
plt.ylabel('Value')
plt.xticks(rotation=45, ha='right')
plt.yscale('log') # Use log scale due to wide range of values
plt.tight_layout()
plt.show()
print("\nDescriptive statistics for each measure:")
display(df.groupby('Measure')['Value'].describe())
Descriptive statistics for each measure:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Measure | ||||||||
Bus Passengers | 31797.0 | 4973.937007 | 16370.017228 | 0.0 | 0.0 | 157.0 | 1522.00 | 513477.0 |
Buses | 31811.0 | 297.004621 | 959.322324 | 0.0 | 0.0 | 7.0 | 84.00 | 27876.0 |
Pedestrians | 32777.0 | 38724.403850 | 116788.115524 | 0.0 | 0.0 | 27.0 | 3806.00 | 1227544.0 |
Personal Vehicle Passengers | 38090.0 | 167429.304358 | 398796.258815 | 0.0 | 2507.0 | 16233.5 | 115167.25 | 4447374.0 |
Personal Vehicles | 38115.0 | 80733.045992 | 190588.195006 | 0.0 | 1269.0 | 8637.0 | 55633.50 | 1744349.0 |
Rail Containers Empty | 29905.0 | 994.425982 | 2843.193245 | 0.0 | 0.0 | 0.0 | 296.00 | 26398.0 |
Rail Containers Loaded | 29805.0 | 1738.716256 | 5215.771216 | 0.0 | 0.0 | 0.0 | 453.00 | 70372.0 |
Train Passengers | 29377.0 | 248.201825 | 1275.781985 | 0.0 | 0.0 | 0.0 | 50.00 | 27777.0 |
Trains | 29927.0 | 36.667691 | 78.547394 | 0.0 | 0.0 | 0.0 | 29.00 | 800.0 |
Truck Containers Empty | 36799.0 | 2334.333542 | 6609.865824 | 0.0 | 17.0 | 227.0 | 1037.00 | 72890.0 |
Truck Containers Loaded | 36177.0 | 6839.389667 | 20649.689017 | 0.0 | 30.0 | 427.0 | 3253.00 | 452331.0 |
Trucks | 36986.0 | 8951.834424 | 24208.845105 | 0.0 | 117.0 | 867.0 | 4727.00 | 267884.0 |
Identified Anomalies¶
Based on the analysis, several data points or patterns stand out as potential anomalies or unusual observations, as described in the analysis results.
print("**Anomaly 1: Extremely Low Bus Passenger Count at San Luis, Arizona**")
san_luis_buses = df[(df['Port Name'] == 'San Luis') & (df['State'] == 'Arizona') & (df['Measure'].isin(['Bus Passengers', 'Buses']))]
print("San Luis, Arizona Bus/Bus Passenger data:")
display(san_luis_buses)
print("\nInsight: The values for bus traffic at San Luis, Arizona are remarkably low compared to other measures at the same high-volume border crossing, suggesting a specific operational characteristic (e.g., bus traffic is primarily handled at a different nearby port) or a potential data entry anomaly.")
print("\n**Anomaly 2: Isolated Data Point for Progreso, Texas (Oct 2023)**")
progreso_oct_2023 = df[(df['Port Name'] == 'Progreso') & (df['State'] == 'Texas') & (df['Date'] == pd.to_datetime('Oct 2023', format='%b %Y'))]
print("Progreso, Texas, Oct 2023 data:")
display(progreso_oct_2023)
print("\nInsight: This single data point from October 2023 is isolated from the main Jan-Apr 2024 dataset, making it an outlier in terms of temporal continuity and limiting its use for trend analysis within the primary period.")
print("\n**Anomaly 3: Consistently Very Low Traffic at Smaller Ports for Specific Measures**")
low_traffic_examples = df[(df['Value'] <= 5) & (df['Measure'].isin(['Pedestrians', 'Buses', 'Trucks']))].sort_values(by='Value')
print("Examples of very low traffic entries (Value <= 5 for Pedestrians, Buses, Trucks, showing top 10):")
display(low_traffic_examples.head(10))
print("\nInsight: Several small US-Canada border crossings consistently show extremely low traffic volumes for certain measures, indicating minimal activity rather than unexpected deviations. These are not necessarily errors but highlight the vast difference in activity levels across ports.")
**Anomaly 1: Extremely Low Bus Passenger Count at San Luis, Arizona** San Luis, Arizona Bus/Bus Passenger data:
Port Name | State | Port Code | Border | Date | Measure | Value | Latitude | Longitude | Point | Month_Num | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | San Luis | Arizona | 2608 | US-Mexico Border | 2024-04-01 | Buses | 10 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 4 | 2024 |
334 | San Luis | Arizona | 2608 | US-Mexico Border | 2024-01-01 | Bus Passengers | 2 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 1 | 2024 |
372 | San Luis | Arizona | 2608 | US-Mexico Border | 2024-01-01 | Buses | 2 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 1 | 2024 |
1519 | San Luis | Arizona | 2608 | US-Mexico Border | 2024-02-01 | Buses | 7 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 2 | 2024 |
1672 | San Luis | Arizona | 2608 | US-Mexico Border | 2024-02-01 | Bus Passengers | 7 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 2 | 2024 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
399755 | San Luis | Arizona | 2608 | US-Mexico Border | 2025-03-01 | Bus Passengers | 3 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 3 | 2025 |
400348 | San Luis | Arizona | 2608 | US-Mexico Border | 2025-04-01 | Buses | 2 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 4 | 2025 |
400584 | San Luis | Arizona | 2608 | US-Mexico Border | 2025-04-01 | Bus Passengers | 2 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 4 | 2025 |
400919 | San Luis | Arizona | 2608 | US-Mexico Border | 2025-05-01 | Bus Passengers | 3 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 5 | 2025 |
401433 | San Luis | Arizona | 2608 | US-Mexico Border | 2025-05-01 | Buses | 3 | 32.485 | -114.782 | POINT (-114.7822222 32.485) | 5 | 2025 |
684 rows × 12 columns
Insight: The values for bus traffic at San Luis, Arizona are remarkably low compared to other measures at the same high-volume border crossing, suggesting a specific operational characteristic (e.g., bus traffic is primarily handled at a different nearby port) or a potential data entry anomaly. **Anomaly 2: Isolated Data Point for Progreso, Texas (Oct 2023)**
Progreso, Texas, Oct 2023 data:
Port Name | State | Port Code | Border | Date | Measure | Value | Latitude | Longitude | Point | Month_Num | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1043 | Progreso | Texas | 2309 | US-Mexico Border | 2023-10-01 | Trucks | 4957 | 26.062 | -97.95 | POINT (-97.949958 26.062133) | 10 | 2023 |
5029 | Progreso | Texas | 2309 | US-Mexico Border | 2023-10-01 | Truck Containers Loaded | 4749 | 26.062 | -97.95 | POINT (-97.949958 26.062133) | 10 | 2023 |
5417 | Progreso | Texas | 2309 | US-Mexico Border | 2023-10-01 | Personal Vehicles | 116889 | 26.062 | -97.95 | POINT (-97.949958 26.062133) | 10 | 2023 |
6464 | Progreso | Texas | 2309 | US-Mexico Border | 2023-10-01 | Pedestrians | 81924 | 26.062 | -97.95 | POINT (-97.949958 26.062133) | 10 | 2023 |
12718 | Progreso | Texas | 2309 | US-Mexico Border | 2023-10-01 | Personal Vehicle Passengers | 247503 | 26.062 | -97.95 | POINT (-97.949958 26.062133) | 10 | 2023 |
15169 | Progreso | Texas | 2309 | US-Mexico Border | 2023-10-01 | Truck Containers Empty | 2505 | 26.062 | -97.95 | POINT (-97.949958 26.062133) | 10 | 2023 |
Insight: This single data point from October 2023 is isolated from the main Jan-Apr 2024 dataset, making it an outlier in terms of temporal continuity and limiting its use for trend analysis within the primary period. **Anomaly 3: Consistently Very Low Traffic at Smaller Ports for Specific Measures** Examples of very low traffic entries (Value <= 5 for Pedestrians, Buses, Trucks, showing top 10):
Port Name | State | Port Code | Border | Date | Measure | Value | Latitude | Longitude | Point | Month_Num | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
202063 | Laurier | Washington | 3016 | US-Canada Border | 2008-12-01 | Pedestrians | 0 | 49.000 | -118.224 | POINT (-118.223777 49.000083) | 12 | 2008 |
213660 | Willow Creek | Montana | 3325 | US-Canada Border | 2008-12-01 | Pedestrians | 0 | 49.000 | -109.731 | POINT (-109.731333 48.999972) | 12 | 2008 |
213651 | Piegan | Montana | 3316 | US-Canada Border | 2008-01-01 | Pedestrians | 0 | 48.998 | -113.379 | POINT (-113.378777 48.998083) | 1 | 2008 |
213646 | Richford | Vermont | 203 | US-Canada Border | 2009-05-01 | Buses | 0 | 45.012 | -72.589 | POINT (-72.588559 45.01174) | 5 | 2009 |
345496 | Sarles | North Dakota | 3409 | US-Canada Border | 1999-09-01 | Buses | 0 | 49.000 | -98.938 | POINT (-98.938361 49.000027) | 9 | 1999 |
213631 | Westhope | North Dakota | 3419 | US-Canada Border | 2009-01-01 | Buses | 0 | 49.000 | -101.017 | POINT (-101.017277 48.999611) | 1 | 2009 |
213613 | Sasabe | Arizona | 2606 | US-Mexico Border | 2009-05-01 | Buses | 0 | 31.483 | -111.544 | POINT (-111.544363 31.483039) | 5 | 2009 |
213609 | Highgate Springs | Vermont | 212 | US-Canada Border | 2008-02-01 | Pedestrians | 0 | 45.015 | -73.085 | POINT (-73.085037 45.015414) | 2 | 2008 |
213582 | Roseau | Minnesota | 3426 | US-Canada Border | 2007-11-01 | Pedestrians | 0 | 49.000 | -95.766 | POINT (-95.766469 48.999538) | 11 | 2007 |
213549 | Whitlash | Montana | 3321 | US-Canada Border | 2008-10-01 | Pedestrians | 0 | 48.997 | -111.258 | POINT (-111.257916 48.99725) | 10 | 2008 |
Insight: Several small US-Canada border crossings consistently show extremely low traffic volumes for certain measures, indicating minimal activity rather than unexpected deviations. These are not necessarily errors but highlight the vast difference in activity levels across ports.
Predictive Outlook (Qualitative)¶
Due to the limited time-series data (only four consecutive months), robust quantitative predictive modeling is not feasible. However, qualitative trends can be inferred from the observed patterns.
print("**Limited Applicability for Robust Prediction:**")
print("The dataset provides only four consecutive months of data (Jan-Apr 2024) and one isolated data point from Oct 2023. This limited time-series information is insufficient for robust statistical predictive modeling (e.g., ARIMA, Prophet) which typically requires longer historical patterns and more data points to establish reliable trends, seasonality, and confidence intervals.")
print("\n**Inferred Short-Term Trends (Qualitative):**")
print("**Increasing Traffic:** Based on the observed Jan-Apr 2024 data, major commercial truck crossings (e.g., Laredo, TX - Trucks; Blaine, WA - Trucks; Port Huron, MI - Truck Containers Loaded) show a consistent upward trend, suggesting a likelihood of continued growth in the immediate future. Similarly, personal vehicle traffic at some key US-Canada crossings (e.g., Blaine, WA - Personal Vehicles) also indicates an upward trend.")
print("**Stable or Fluctuating Traffic:** Personal vehicle and passenger traffic at some high-volume US-Mexico border crossings (e.g., San Ysidro, CA - Personal Vehicles; Ysleta, TX - Personal Vehicle Passengers) appears to be relatively stable or showing minor fluctuations within a high range, suggesting similar patterns for the next month.")
print("**Low Volume Persistence:** Ports and measures with consistently very low values (e.g., Pedestrians at many US-Canada crossings, Buses at smaller ports) are likely to maintain low volumes in the near term.")
**Limited Applicability for Robust Prediction:** The dataset provides only four consecutive months of data (Jan-Apr 2024) and one isolated data point from Oct 2023. This limited time-series information is insufficient for robust statistical predictive modeling (e.g., ARIMA, Prophet) which typically requires longer historical patterns and more data points to establish reliable trends, seasonality, and confidence intervals. **Inferred Short-Term Trends (Qualitative):** **Increasing Traffic:** Based on the observed Jan-Apr 2024 data, major commercial truck crossings (e.g., Laredo, TX - Trucks; Blaine, WA - Trucks; Port Huron, MI - Truck Containers Loaded) show a consistent upward trend, suggesting a likelihood of continued growth in the immediate future. Similarly, personal vehicle traffic at some key US-Canada crossings (e.g., Blaine, WA - Personal Vehicles) also indicates an upward trend. **Stable or Fluctuating Traffic:** Personal vehicle and passenger traffic at some high-volume US-Mexico border crossings (e.g., San Ysidro, CA - Personal Vehicles; Ysleta, TX - Personal Vehicle Passengers) appears to be relatively stable or showing minor fluctuations within a high range, suggesting similar patterns for the next month. **Low Volume Persistence:** Ports and measures with consistently very low values (e.g., Pedestrians at many US-Canada crossings, Buses at smaller ports) are likely to maintain low volumes in the near term.
Summary of Analysis¶
The provided border crossing data for Jan-Apr 2024 highlights significant differences in traffic patterns between the US-Mexico and US-Canada borders. The US-Mexico border experiences substantially higher volumes of personal vehicles, passengers, and pedestrians, while both borders show high commercial truck and container traffic at key ports. Month-over-month trends for Jan-Apr 2024 are mixed, with some major crossings showing growth in traffic, while others remain stable or fluctuate slightly. Due to the limited temporal scope of the data, robust predictive analysis is not feasible, but short-term qualitative trends suggest continued growth in commercial traffic at major hubs and stable or slightly fluctuating personal traffic at high-volume crossings. Several entries with extremely low values for specific measures at otherwise active ports or at very small crossings were noted, with the 'San Luis, Arizona, Buses' entry being the most notable potential anomaly.