CSV Data Analysis of Travel Itineraries¶
This Jupyter Notebook performs exploratory data analysis on a CSV dataset containing information about travel itineraries. The analysis aims to provide insights into travel patterns, popular destinations, budget distributions, and user engagement metrics. We will load the data using pandas, visualize key aspects using matplotlib and seaborn, and highlight potential anomalies and trends within the dataset.
In [1]:
# Import Libraries
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import json
%matplotlib inline
In [3]:
# Load CSV Data
In [4]:
# Load the itineraries CSV file into a pandas DataFrame
try:
df = pd.read_csv('itineraries.csv')
print("CSV data loaded successfully.")
except FileNotFoundError:
print("Error: itineraries.csv not found. Please ensure the file is in the correct directory.")
df = None # Assign None to df in case of error
if df is not None:
print(df.head())
print(df.info())
CSV data loaded successfully. _id \ 0 ObjectId(6713c211cb4b48e644dcbb41) 1 ObjectId(6713f46a91c1ab1760bef904) 2 ObjectId(671426f7b46cd8d269813175) 3 ObjectId(67142c5cb46cd8d2698132bc) 4 ObjectId(671433aab46cd8d269813a6d) slug user destination duration \ 0 prague-solo-2days-budget-500dollars-en NaN Prague 2 1 amsterdam-friends-3days-budget-500dollars-en NaN Amsterdam 3 2 utrecht-solo-2days-budget-500dollars-en NaN Utrecht 2 3 canada-solo-2days-budget-500dollars-en NaN Canada 2 4 barcelona-solo-2days-budget-500dollars-en NaN Barcelona 2 countryLang countryCurrency createdAt lastModified budget \ 0 cs CZK 2024-10-19T14:28:33.840Z NaN $500 1 nl EUR 2024-10-19T18:03:22.641Z NaN $500 2 NaN NaN 2024-10-19T21:39:03.026Z NaN $500 3 NaN NaN 2024-10-19T22:02:04.343Z NaN $500 4 es EUR 2024-10-19T22:33:14.353Z NaN $500 ... weather carbonFootprint \ 0 ... {"forecast":[]} NaN 1 ... {"temperature":14.66,"forecast":[]} NaN 2 ... {"conditions":"overcast clouds","temperature":... NaN 3 ... {"conditions":"overcast clouds","temperature":... NaN 4 ... {"conditions":"clear sky","temperature":17.33,... NaN collaborators ratings comments socialShares isPublic comments_1 views \ 0 [] NaN [] NaN True [] 1184 1 [] NaN [] NaN True [] 132 2 [] NaN [] NaN True [] 28 3 [] NaN [] NaN True [] 77 4 [] NaN [] NaN True [] 116 likes 0 13 1 0 2 1 3 1 4 1 [5 rows x 24 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1132 entries, 0 to 1131 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 1132 non-null object 1 slug 1132 non-null object 2 user 145 non-null object 3 destination 1132 non-null object 4 duration 1132 non-null int64 5 countryLang 960 non-null object 6 countryCurrency 888 non-null object 7 createdAt 1132 non-null object 8 lastModified 0 non-null float64 9 budget 1132 non-null object 10 news 0 non-null float64 11 language 1132 non-null object 12 preferences 1132 non-null object 13 coordinates 0 non-null float64 14 weather 982 non-null object 15 carbonFootprint 0 non-null float64 16 collaborators 1132 non-null object 17 ratings 0 non-null float64 18 comments 1132 non-null object 19 socialShares 0 non-null float64 20 isPublic 1132 non-null bool 21 comments_1 1132 non-null object 22 views 1132 non-null int64 23 likes 1132 non-null int64 dtypes: bool(1), float64(6), int64(3), object(14) memory usage: 204.6+ KB None
Traveler Type Analysis¶
In [5]:
if df is not None:
# Infer traveler type from 'slug' and 'preferences'
df['traveler_type'] = df['slug'].str.extract(r'-(solo|couple|friends|family)-')
df['traveler_type'] = df['traveler_type'].fillna(df['preferences'].fillna('unknown'))
plt.figure(figsize=(8, 6))
sns.countplot(x='traveler_type', data=df, order=df['traveler_type'].value_counts().index)
plt.title('Distribution of Traveler Types')
plt.xlabel('Traveler Type')
plt.ylabel('Number of Itineraries')
plt.show()
print(df['traveler_type'].value_counts())
traveler_type solo 519 family 261 couple 208 friends 134 business 10 Name: count, dtype: int64
Itinerary Duration Analysis¶
In [6]:
if df is not None:
plt.figure(figsize=(8, 6))
sns.countplot(x='duration', data=df, order=df['duration'].value_counts().index)
plt.title('Distribution of Itinerary Durations')
plt.xlabel('Duration (Days)')
plt.ylabel('Number of Itineraries')
plt.show()
print(df['duration'].value_counts())
duration 3 567 2 565 Name: count, dtype: int64
Budget Distribution Analysis¶
In [7]:
if df is not None:
# Clean budget column and convert to numeric
df['budget_numeric'] = df['budget'].str.replace('$', '').str.replace(',', '', regex=False).astype(float)
plt.figure(figsize=(8, 6))
sns.histplot(df['budget_numeric'].dropna(), bins=30, kde=True)
plt.title('Distribution of Itinerary Budgets')
plt.xlabel('Budget (USD)')
plt.ylabel('Frequency')
plt.show()
print(df['budget_numeric'].describe())
count 1132.000000 mean 687.455830 std 484.911026 min 100.000000 25% 500.000000 50% 500.000000 75% 500.000000 max 1500.000000 Name: budget_numeric, dtype: float64
Language Preference Analysis¶
In [8]:
if df is not None:
plt.figure(figsize=(10, 6))
sns.countplot(y='language', data=df, order=df['language'].value_counts().head(10).index)
plt.title('Top 10 Languages for Itineraries')
plt.xlabel('Number of Itineraries')
plt.ylabel('Language')
plt.show()
print(df['language'].value_counts().head(10))
language en 855 pt 59 de 57 it 54 ru 36 es 35 fr 29 nl 4 hu 3 Name: count, dtype: int64
Popular Destinations (Top 10)¶
In [9]:
if df is not None:
plt.figure(figsize=(10, 6))
sns.countplot(y='destination', data=df, order=df['destination'].value_counts().head(10).index)
plt.title('Top 10 Destinations')
plt.xlabel('Number of Itineraries')
plt.ylabel('Destination')
plt.show()
print(df['destination'].value_counts().head(10))
destination Prague 17 London 17 Paris 17 Rome 14 Tokyo 14 Madrid 14 Dubai 12 Berlin 11 Lisbon 9 Barcelona 9 Name: count, dtype: int64
Itinerary Engagement: Views vs. Likes¶
In [10]:
if df is not None:
engagement_metrics = df[['views', 'likes']].describe().loc[['mean', '50%']]
print("Summary Statistics for Views and Likes:\n", engagement_metrics)
engagement_data = df[['views', 'likes']].melt(var_name='Metric', value_name='Count')
plt.figure(figsize=(8, 6))
sns.boxplot(x='Metric', y='Count', data=engagement_data)
plt.yscale('log') # Use log scale to handle large view counts
plt.title('Distribution of Views vs Likes (Log Scale)')
plt.ylabel('Count (Log Scale)')
plt.show()
Summary Statistics for Views and Likes: views likes mean 7.577739 0.166078 50% 3.000000 0.000000
Anomalies: Empty 'user' Column¶
In [11]:
if df is not None:
user_value_counts = df['user'].value_counts(dropna=False)
print("Value Counts for 'user' column:\n", user_value_counts.head())
print(f"Percentage of missing user values: {df['user'].isnull().mean() * 100:.2f}%\n")
print("Insight: The 'user' column is mostly empty, as shown above.")
Value Counts for 'user' column: user NaN 987 ObjectId(6716bb0a8171bf60de08b8bc) 96 ObjectId(67b5c54b45d3a8c7769e2c73) 7 ObjectId(6734216fba86b0ef3981a9cd) 3 ObjectId(674cf242d819cd616dc67fc1) 3 Name: count, dtype: int64 Percentage of missing user values: 87.19% Insight: The 'user' column is mostly empty, as shown above.
Anomalies: Missing Country Language and Currency¶
In [12]:
if df is not None:
missing_country_info = df[df['countryLang'].isnull() | df['countryCurrency'].isnull()]
print("Number of rows with missing Country Language or Currency:", len(missing_country_info))
print("\nExamples of rows with missing Country Language or Currency:\n", missing_country_info[['destination', 'countryLang', 'countryCurrency']].head())
Number of rows with missing Country Language or Currency: 244 Examples of rows with missing Country Language or Currency: destination countryLang countryCurrency 2 Utrecht NaN NaN 3 Canada NaN NaN 7 Málaga NaN NaN 9 Aomori ja NaN 15 London NaN NaN
Anomalies: Fictional Destinations¶
In [13]:
if df is not None:
fictional_destinations = df[df['destination'].isin(['Westeros', 'Ringworld', 'Gotham City', 'Hogwarts', 'Agrabah', 'Azkaban', 'Purgatory, Between Heaven And Hell', 'Rosetta Planet, Mario Kart 8', 'Under Your Bed'])]
print("Examples of Fictional Destinations:\n", fictional_destinations[['destination', 'slug']].to_string())
Examples of Fictional Destinations: destination slug 30 Westeros westeros-solo-3days-budget-1500dollars-en 31 Ringworld ringworld-solo-3days-budget-1500dollars-en 32 Gotham City gotham-city-family-3days-budget-1500dollars-en 33 Hogwarts hogwarts-solo-2days-budget-500dollars-en 34 Agrabah agrabah-solo-2days-budget-500dollars-en 41 Azkaban azkaban-family-2days-budget-1500dollars-en 42 Under Your Bed under-your-bed-solo-2days-budget-100dollars-en 153 Purgatory, Between Heaven And Hell purgatory-between-heaven-and-hell-solo-3days-budget-1500dollars-en 154 Rosetta Planet, Mario Kart 8 rosetta-planet-mario-kart-8-solo-3days-budget-1500dollars-en
Anomalies: Potentially Inconsistent Malta Temperature (Example)¶
In [14]:
if df is not None:
malta_temp_anomaly = df[df['destination'] == 'Malta']
print("Itinerary for Malta with potentially anomalous temperature:\n", malta_temp_anomaly[['destination', 'coordinates']])
Itinerary for Malta with potentially anomalous temperature: destination coordinates 263 Malta NaN 880 Malta NaN
Predictions: Not Applicable¶
In [15]:
# Predictions Section
print("Predictions: Predictive analysis is not directly applicable with this dataset in its current form.")
print("Reason: To make meaningful predictions (e.g., itinerary popularity, user preferences), more features and a defined prediction target are needed.")
Predictions: Predictive analysis is not directly applicable with this dataset in its current form. Reason: To make meaningful predictions (e.g., itinerary popularity, user preferences), more features and a defined prediction target are needed.
Summary of Analysis¶
In [16]:
print("Summary: The CSV data provides insights into user-generated travel itineraries, predominantly for solo travelers on budget-conscious short trips, with Europe and Asia being popular destinations. Anomalies include missing user data, incomplete location details, fictional destinations, and potentially inconsistent temperature data. Predictive analysis is not feasible with the current dataset without further refinement and feature engineering.")
Summary: The CSV data provides insights into user-generated travel itineraries, predominantly for solo travelers on budget-conscious short trips, with Europe and Asia being popular destinations. Anomalies include missing user data, incomplete location details, fictional destinations, and potentially inconsistent temperature data. Predictive analysis is not feasible with the current dataset without further refinement and feature engineering.