Titanic Dataset Analysis¶
This Jupyter Notebook performs an exploratory data analysis of the Titanic dataset to understand passenger survival patterns. We will load the dataset, visualize key features, and derive insights about factors influencing survival. We will also highlight anomalies in the data and discuss the feasibility of predictive modeling based on the findings.
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Load the Dataset
try:
titanic_df = pd.read_csv('titanic-dataset.csv')
print("Dataset loaded successfully!")
except FileNotFoundError:
print("Error: titanic-dataset.csv not found. Please ensure the file is in the correct directory.")
except Exception as e:
print(f"An error occurred while loading the dataset: {e}")
Dataset loaded successfully!
Data Overview¶
# Display basic info and head of the dataframe
if 'titanic_df' in locals():
titanic_df.info()
print("\nFirst 5 rows of the dataset:")
print(titanic_df.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB First 5 rows of the dataset: PassengerId Survived Pclass \ 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 Name Sex Age SibSp \ 0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 NaN S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 NaN S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 NaN S
Survival Rate by Gender¶
# Survival Rate by Gender
if 'titanic_df' in locals():
gender_survival = titanic_df.groupby('Sex')['Survived'].mean().reset_index()
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=gender_survival)
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.show()
print(gender_survival)
Sex Survived 0 female 0.742038 1 male 0.188908
Survival Rate by Passenger Class (Pclass)¶
# Survival Rate by Passenger Class
if 'titanic_df' in locals():
pclass_survival = titanic_df.groupby('Pclass')['Survived'].mean().reset_index()
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=pclass_survival, palette='viridis')
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.xlabel('Passenger Class')
plt.xticks([0, 1, 2], ['1st', '2nd', '3rd'])
plt.show()
print(pclass_survival)
/tmp/ipykernel_482/2780641366.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='Pclass', y='Survived', data=pclass_survival, palette='viridis')
Pclass Survived 0 1 0.629630 1 2 0.472826 2 3 0.242363
Age Distribution and Survival¶
# Age Distribution and Survival
if 'titanic_df' in locals():
plt.figure(figsize=(8, 5))
sns.histplot(titanic_df['Age'].dropna(), kde=True, bins=30)
plt.title('Age Distribution of Passengers')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(8, 5))
sns.violinplot(x='Pclass', y='Age', hue='Survived', data=titanic_df, split=True, palette=['#FF4136', '#2ECC40'])
plt.title('Survival by Age and Passenger Class')
plt.ylabel('Age')
plt.xlabel('Passenger Class')
plt.xticks([0, 1, 2], ['1st', '2nd', '3rd'])
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()
Insights from Analysis¶
- Overall survival rate in this dataset is approximately 38.4%.
- Females have a significantly higher survival rate compared to males. Observing the data, it seems most or all females in Pclass 1 and 2 survived.
- Passengers in Pclass 1 have a higher survival rate compared to Pclass 2 and Pclass 3.
- Children (Master. and Miss. titles suggest younger passengers, though age is missing for some) appear to have a higher survival rate, especially those in higher Pclasses.
- Passengers with missing 'Age' values are present, and their survival rate may need to be considered separately in a larger analysis.
- A significant number of 'Cabin' values are missing, making cabin-based survival analysis limited with this data.
- Embarked 'S' appears to be the most common port, followed by 'C' and 'Q'. Survival rate differences based on embarkation port could be further investigated with a larger dataset.
- Some passengers have a Fare of '0', which could be an anomaly or represent special circumstances (e.g., crew, free tickets).
- Families traveling together (indicated by SibSp and Parch values) are present, and their survival patterns could be analyzed further in a larger dataset to see if family size influenced survival.
Anomalies Identified¶
- Missing 'Age' values for several passengers. This could skew age-related survival analysis.
- Large number of missing 'Cabin' values, limiting cabin-based analysis.
- Passengers with 'Fare' equal to 0, which is unusual and requires further investigation to determine if it's an error or a valid entry.
- Inconsistencies in 'Name' format (e.g., titles like 'Master.', 'Mrs.', 'Miss.', 'Mr.', 'Don.', 'Mlle.', 'Mme.', 'Major.', 'Lady.', 'Sir.', 'Col.', 'Rev.', 'Dr.') which might need standardization for advanced name-based analysis.
- Ticket value 'LINE' is present, which might be a placeholder or represent special ticket types. Fares are '0' for these tickets.
Predictions¶
Predictive analysis is not statistically robust with this limited sample of 891 rows from the Titanic dataset. A larger dataset is necessary for building and validating a reliable predictive model. However, based on the observed trends in this data, we can infer that factors like passenger class (Pclass), sex, and potentially age and family size are likely to be important predictors of survival probability in a more comprehensive model. Further analysis with the full Titanic dataset and feature engineering would be required to create and evaluate a predictive model.
Summary¶
This analysis of a portion of the Titanic dataset reveals key survival trends. Females and passengers in higher classes (Pclass 1 and 2) had a significantly higher chance of survival. Children, indicated by titles like 'Master.' and 'Miss.', also appear to have had a higher survival rate. Data quality issues include missing 'Age' and 'Cabin' values, and some anomalous 'Fare' values of '0'. Due to the limited dataset size, predictive modeling is not statistically advisable at this stage, but the identified trends provide valuable insights into the factors influencing survival on the Titanic.