Titanic Dataset Analysis¶

This Jupyter Notebook performs an exploratory data analysis of the Titanic dataset to understand passenger survival patterns. We will load the dataset, visualize key features, and derive insights about factors influencing survival. We will also highlight anomalies in the data and discuss the feasibility of predictive modeling based on the findings.

In [1]:

# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:

# Load the Dataset
try:
    titanic_df = pd.read_csv('titanic-dataset.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: titanic-dataset.csv not found. Please ensure the file is in the correct directory.")
except Exception as e:
    print(f"An error occurred while loading the dataset: {e}")

Dataset loaded successfully!

Data Overview¶

In [3]:

# Display basic info and head of the dataframe
if 'titanic_df' in locals():
    titanic_df.info()
    print("\nFirst 5 rows of the dataset:")
    print(titanic_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

First 5 rows of the dataset:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S

Survival Rate by Gender¶

In [4]:

# Survival Rate by Gender
if 'titanic_df' in locals():
    gender_survival = titanic_df.groupby('Sex')['Survived'].mean().reset_index()
    plt.figure(figsize=(6, 4))
    sns.barplot(x='Sex', y='Survived', data=gender_survival)
    plt.title('Survival Rate by Gender')
    plt.ylabel('Survival Rate')
    plt.show()
    print(gender_survival)

No description has been provided for this image

      Sex  Survived
0  female  0.742038
1    male  0.188908

Survival Rate by Passenger Class (Pclass)¶

In [5]:

# Survival Rate by Passenger Class
if 'titanic_df' in locals():
    pclass_survival = titanic_df.groupby('Pclass')['Survived'].mean().reset_index()
    plt.figure(figsize=(6, 4))
    sns.barplot(x='Pclass', y='Survived', data=pclass_survival, palette='viridis')
    plt.title('Survival Rate by Passenger Class')
    plt.ylabel('Survival Rate')
    plt.xlabel('Passenger Class')
    plt.xticks([0, 1, 2], ['1st', '2nd', '3rd'])
    plt.show()
    print(pclass_survival)

/tmp/ipykernel_482/2780641366.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Pclass', y='Survived', data=pclass_survival, palette='viridis')

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363

Age Distribution and Survival¶

In [6]:

# Age Distribution and Survival
if 'titanic_df' in locals():
    plt.figure(figsize=(8, 5))
    sns.histplot(titanic_df['Age'].dropna(), kde=True, bins=30)
    plt.title('Age Distribution of Passengers')
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.show()

    plt.figure(figsize=(8, 5))
    sns.violinplot(x='Pclass', y='Age', hue='Survived', data=titanic_df, split=True, palette=['#FF4136', '#2ECC40'])
    plt.title('Survival by Age and Passenger Class')
    plt.ylabel('Age')
    plt.xlabel('Passenger Class')
    plt.xticks([0, 1, 2], ['1st', '2nd', '3rd'])
    plt.legend(title='Survived', labels=['No', 'Yes'])
    plt.show()

Insights from Analysis¶

Overall survival rate in this dataset is approximately 38.4%.
Females have a significantly higher survival rate compared to males. Observing the data, it seems most or all females in Pclass 1 and 2 survived.
Passengers in Pclass 1 have a higher survival rate compared to Pclass 2 and Pclass 3.
Children (Master. and Miss. titles suggest younger passengers, though age is missing for some) appear to have a higher survival rate, especially those in higher Pclasses.
Passengers with missing 'Age' values are present, and their survival rate may need to be considered separately in a larger analysis.
A significant number of 'Cabin' values are missing, making cabin-based survival analysis limited with this data.
Embarked 'S' appears to be the most common port, followed by 'C' and 'Q'. Survival rate differences based on embarkation port could be further investigated with a larger dataset.
Some passengers have a Fare of '0', which could be an anomaly or represent special circumstances (e.g., crew, free tickets).
Families traveling together (indicated by SibSp and Parch values) are present, and their survival patterns could be analyzed further in a larger dataset to see if family size influenced survival.

Anomalies Identified¶

Missing 'Age' values for several passengers. This could skew age-related survival analysis.
Large number of missing 'Cabin' values, limiting cabin-based analysis.
Passengers with 'Fare' equal to 0, which is unusual and requires further investigation to determine if it's an error or a valid entry.
Inconsistencies in 'Name' format (e.g., titles like 'Master.', 'Mrs.', 'Miss.', 'Mr.', 'Don.', 'Mlle.', 'Mme.', 'Major.', 'Lady.', 'Sir.', 'Col.', 'Rev.', 'Dr.') which might need standardization for advanced name-based analysis.
Ticket value 'LINE' is present, which might be a placeholder or represent special ticket types. Fares are '0' for these tickets.

Predictions¶

Predictive analysis is not statistically robust with this limited sample of 891 rows from the Titanic dataset. A larger dataset is necessary for building and validating a reliable predictive model. However, based on the observed trends in this data, we can infer that factors like passenger class (Pclass), sex, and potentially age and family size are likely to be important predictors of survival probability in a more comprehensive model. Further analysis with the full Titanic dataset and feature engineering would be required to create and evaluate a predictive model.

Summary¶

This analysis of a portion of the Titanic dataset reveals key survival trends. Females and passengers in higher classes (Pclass 1 and 2) had a significantly higher chance of survival. Children, indicated by titles like 'Master.' and 'Miss.', also appear to have had a higher survival rate. Data quality issues include missing 'Age' and 'Cabin' values, and some anomalous 'Fare' values of '0'. Due to the limited dataset size, predictive modeling is not statistically advisable at this stage, but the identified trends provide valuable insights into the factors influencing survival on the Titanic.