CSV Data Analysis of Travel Itineraries¶

This Jupyter Notebook performs exploratory data analysis on a CSV dataset containing information about travel itineraries. The analysis aims to provide insights into travel patterns, popular destinations, budget distributions, and user engagement metrics. We will load the data using pandas, visualize key aspects using matplotlib and seaborn, and highlight potential anomalies and trends within the dataset.

In [1]:

# Import Libraries

In [2]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import json

%matplotlib inline

In [3]:

# Load CSV Data

In [4]:

# Load the itineraries CSV file into a pandas DataFrame
try:
    df = pd.read_csv('itineraries.csv')
    print("CSV data loaded successfully.")
except FileNotFoundError:
    print("Error: itineraries.csv not found. Please ensure the file is in the correct directory.")
    df = None # Assign None to df in case of error

if df is not None:
    print(df.head())
    print(df.info())

CSV data loaded successfully.
                                  _id  \
0  ObjectId(6713c211cb4b48e644dcbb41)   
1  ObjectId(6713f46a91c1ab1760bef904)   
2  ObjectId(671426f7b46cd8d269813175)   
3  ObjectId(67142c5cb46cd8d2698132bc)   
4  ObjectId(671433aab46cd8d269813a6d)   

                                           slug user destination  duration  \
0        prague-solo-2days-budget-500dollars-en  NaN      Prague         2   
1  amsterdam-friends-3days-budget-500dollars-en  NaN   Amsterdam         3   
2       utrecht-solo-2days-budget-500dollars-en  NaN     Utrecht         2   
3        canada-solo-2days-budget-500dollars-en  NaN      Canada         2   
4     barcelona-solo-2days-budget-500dollars-en  NaN   Barcelona         2   

  countryLang countryCurrency                 createdAt  lastModified budget  \
0          cs             CZK  2024-10-19T14:28:33.840Z           NaN   $500   
1          nl             EUR  2024-10-19T18:03:22.641Z           NaN   $500   
2         NaN             NaN  2024-10-19T21:39:03.026Z           NaN   $500   
3         NaN             NaN  2024-10-19T22:02:04.343Z           NaN   $500   
4          es             EUR  2024-10-19T22:33:14.353Z           NaN   $500   

   ...                                            weather carbonFootprint  \
0  ...                                    {"forecast":[]}             NaN   
1  ...                {"temperature":14.66,"forecast":[]}             NaN   
2  ...  {"conditions":"overcast clouds","temperature":...             NaN   
3  ...  {"conditions":"overcast clouds","temperature":...             NaN   
4  ...  {"conditions":"clear sky","temperature":17.33,...             NaN   

  collaborators  ratings comments  socialShares isPublic  comments_1 views  \
0            []      NaN       []           NaN     True          []  1184   
1            []      NaN       []           NaN     True          []   132   
2            []      NaN       []           NaN     True          []    28   
3            []      NaN       []           NaN     True          []    77   
4            []      NaN       []           NaN     True          []   116   

   likes  
0     13  
1      0  
2      1  
3      1  
4      1  

[5 rows x 24 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132 entries, 0 to 1131
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   _id              1132 non-null   object 
 1   slug             1132 non-null   object 
 2   user             145 non-null    object 
 3   destination      1132 non-null   object 
 4   duration         1132 non-null   int64  
 5   countryLang      960 non-null    object 
 6   countryCurrency  888 non-null    object 
 7   createdAt        1132 non-null   object 
 8   lastModified     0 non-null      float64
 9   budget           1132 non-null   object 
 10  news             0 non-null      float64
 11  language         1132 non-null   object 
 12  preferences      1132 non-null   object 
 13  coordinates      0 non-null      float64
 14  weather          982 non-null    object 
 15  carbonFootprint  0 non-null      float64
 16  collaborators    1132 non-null   object 
 17  ratings          0 non-null      float64
 18  comments         1132 non-null   object 
 19  socialShares     0 non-null      float64
 20  isPublic         1132 non-null   bool   
 21  comments_1       1132 non-null   object 
 22  views            1132 non-null   int64  
 23  likes            1132 non-null   int64  
dtypes: bool(1), float64(6), int64(3), object(14)
memory usage: 204.6+ KB
None

Traveler Type Analysis¶

In [5]:

if df is not None:
    # Infer traveler type from 'slug' and 'preferences'
    df['traveler_type'] = df['slug'].str.extract(r'-(solo|couple|friends|family)-')
    df['traveler_type'] = df['traveler_type'].fillna(df['preferences'].fillna('unknown'))

    plt.figure(figsize=(8, 6))
    sns.countplot(x='traveler_type', data=df, order=df['traveler_type'].value_counts().index)
    plt.title('Distribution of Traveler Types')
    plt.xlabel('Traveler Type')
    plt.ylabel('Number of Itineraries')
    plt.show()
    print(df['traveler_type'].value_counts())

No description has been provided for this image

traveler_type
solo        519
family      261
couple      208
friends     134
business     10
Name: count, dtype: int64

Itinerary Duration Analysis¶

In [6]:

if df is not None:
    plt.figure(figsize=(8, 6))
    sns.countplot(x='duration', data=df, order=df['duration'].value_counts().index)
    plt.title('Distribution of Itinerary Durations')
    plt.xlabel('Duration (Days)')
    plt.ylabel('Number of Itineraries')
    plt.show()
    print(df['duration'].value_counts())

duration
3    567
2    565
Name: count, dtype: int64

Budget Distribution Analysis¶

In [7]:

if df is not None:
    # Clean budget column and convert to numeric
    df['budget_numeric'] = df['budget'].str.replace('$', '').str.replace(',', '', regex=False).astype(float)

    plt.figure(figsize=(8, 6))
    sns.histplot(df['budget_numeric'].dropna(), bins=30, kde=True)
    plt.title('Distribution of Itinerary Budgets')
    plt.xlabel('Budget (USD)')
    plt.ylabel('Frequency')
    plt.show()
    print(df['budget_numeric'].describe())

count    1132.000000
mean      687.455830
std       484.911026
min       100.000000
25%       500.000000
50%       500.000000
75%       500.000000
max      1500.000000
Name: budget_numeric, dtype: float64

Language Preference Analysis¶

In [8]:

if df is not None:
    plt.figure(figsize=(10, 6))
    sns.countplot(y='language', data=df, order=df['language'].value_counts().head(10).index)
    plt.title('Top 10 Languages for Itineraries')
    plt.xlabel('Number of Itineraries')
    plt.ylabel('Language')
    plt.show()
    print(df['language'].value_counts().head(10))

language
en    855
pt     59
de     57
it     54
ru     36
es     35
fr     29
nl      4
hu      3
Name: count, dtype: int64

Popular Destinations (Top 10)¶

In [9]:

if df is not None:
    plt.figure(figsize=(10, 6))
    sns.countplot(y='destination', data=df, order=df['destination'].value_counts().head(10).index)
    plt.title('Top 10 Destinations')
    plt.xlabel('Number of Itineraries')
    plt.ylabel('Destination')
    plt.show()
    print(df['destination'].value_counts().head(10))

destination
Prague       17
London       17
Paris        17
Rome         14
Tokyo        14
Madrid       14
Dubai        12
Berlin       11
Lisbon        9
Barcelona     9
Name: count, dtype: int64

Itinerary Engagement: Views vs. Likes¶

In [10]:

if df is not None:
    engagement_metrics = df[['views', 'likes']].describe().loc[['mean', '50%']]
    print("Summary Statistics for Views and Likes:\n", engagement_metrics)

    engagement_data = df[['views', 'likes']].melt(var_name='Metric', value_name='Count')
    plt.figure(figsize=(8, 6))
    sns.boxplot(x='Metric', y='Count', data=engagement_data)
    plt.yscale('log') # Use log scale to handle large view counts
    plt.title('Distribution of Views vs Likes (Log Scale)')
    plt.ylabel('Count (Log Scale)')
    plt.show()

Summary Statistics for Views and Likes:
          views     likes
mean  7.577739  0.166078
50%   3.000000  0.000000

Anomalies: Empty 'user' Column¶

In [11]:

if df is not None:
    user_value_counts = df['user'].value_counts(dropna=False)
    print("Value Counts for 'user' column:\n", user_value_counts.head())
    print(f"Percentage of missing user values: {df['user'].isnull().mean() * 100:.2f}%\n")
    print("Insight: The 'user' column is mostly empty, as shown above.")

Value Counts for 'user' column:
 user
NaN                                   987
ObjectId(6716bb0a8171bf60de08b8bc)     96
ObjectId(67b5c54b45d3a8c7769e2c73)      7
ObjectId(6734216fba86b0ef3981a9cd)      3
ObjectId(674cf242d819cd616dc67fc1)      3
Name: count, dtype: int64
Percentage of missing user values: 87.19%

Insight: The 'user' column is mostly empty, as shown above.

Anomalies: Missing Country Language and Currency¶

In [12]:

if df is not None:
    missing_country_info = df[df['countryLang'].isnull() | df['countryCurrency'].isnull()]
    print("Number of rows with missing Country Language or Currency:", len(missing_country_info))
    print("\nExamples of rows with missing Country Language or Currency:\n", missing_country_info[['destination', 'countryLang', 'countryCurrency']].head())

Number of rows with missing Country Language or Currency: 244

Examples of rows with missing Country Language or Currency:
    destination countryLang countryCurrency
2      Utrecht         NaN             NaN
3       Canada         NaN             NaN
7       Málaga         NaN             NaN
9       Aomori          ja             NaN
15      London         NaN             NaN

Anomalies: Fictional Destinations¶

In [13]:

if df is not None:
    fictional_destinations = df[df['destination'].isin(['Westeros', 'Ringworld', 'Gotham City', 'Hogwarts', 'Agrabah', 'Azkaban', 'Purgatory, Between Heaven And Hell', 'Rosetta Planet, Mario Kart 8', 'Under Your Bed'])]
    print("Examples of Fictional Destinations:\n", fictional_destinations[['destination', 'slug']].to_string())

Examples of Fictional Destinations:
                             destination                                                                slug
30                             Westeros                           westeros-solo-3days-budget-1500dollars-en
31                            Ringworld                          ringworld-solo-3days-budget-1500dollars-en
32                          Gotham City                      gotham-city-family-3days-budget-1500dollars-en
33                             Hogwarts                            hogwarts-solo-2days-budget-500dollars-en
34                              Agrabah                             agrabah-solo-2days-budget-500dollars-en
41                              Azkaban                          azkaban-family-2days-budget-1500dollars-en
42                       Under Your Bed                      under-your-bed-solo-2days-budget-100dollars-en
153  Purgatory, Between Heaven And Hell  purgatory-between-heaven-and-hell-solo-3days-budget-1500dollars-en
154        Rosetta Planet, Mario Kart 8        rosetta-planet-mario-kart-8-solo-3days-budget-1500dollars-en

Anomalies: Potentially Inconsistent Malta Temperature (Example)¶

In [14]:

if df is not None:
    malta_temp_anomaly = df[df['destination'] == 'Malta']
    print("Itinerary for Malta with potentially anomalous temperature:\n", malta_temp_anomaly[['destination', 'coordinates']])

Itinerary for Malta with potentially anomalous temperature:
     destination  coordinates
263       Malta          NaN
880       Malta          NaN

Predictions: Not Applicable¶

In [15]:

# Predictions Section
print("Predictions: Predictive analysis is not directly applicable with this dataset in its current form.")
print("Reason: To make meaningful predictions (e.g., itinerary popularity, user preferences), more features and a defined prediction target are needed.")

Predictions: Predictive analysis is not directly applicable with this dataset in its current form.
Reason: To make meaningful predictions (e.g., itinerary popularity, user preferences), more features and a defined prediction target are needed.

Summary of Analysis¶

In [16]:

print("Summary: The CSV data provides insights into user-generated travel itineraries, predominantly for solo travelers on budget-conscious short trips, with Europe and Asia being popular destinations. Anomalies include missing user data, incomplete location details, fictional destinations, and potentially inconsistent temperature data. Predictive analysis is not feasible with the current dataset without further refinement and feature engineering.")

Summary: The CSV data provides insights into user-generated travel itineraries, predominantly for solo travelers on budget-conscious short trips, with Europe and Asia being popular destinations. Anomalies include missing user data, incomplete location details, fictional destinations, and potentially inconsistent temperature data. Predictive analysis is not feasible with the current dataset without further refinement and feature engineering.