Follow our blog ⇒ Follow

Titanic Survival Prediction: Machine Learning Analysis

Read time: 2 min

Predicting the survival of passengers aboard the infamous Titanic has been a classic machine-learning problem. In this blog, I’ll walk through my approach to building a high-accuracy model for Titanic survival prediction, detailing the steps taken for data preprocessing, feature engineering, model building, and evaluation.

1. Project Objective

The objective is to develop a predictive model that accurately determines the survival of passengers based on available features from the Titanic dataset. The ultimate goal is to achieve the highest accuracy on Kaggle's competition leaderboard.

2. Libraries and Tools

To efficiently tackle this problem, I utilized the following libraries:

  • pandas: Data manipulation
  • seaborn & matplotlib: Data visualization
  • scikit-learn: Model training and evaluation
  • XGBoost & CatBoost: Advanced classifiers
  • Streamlit: Web app deployment

3. Data Exploration and Preprocessing

I began by importing the dataset and performing an Exploratory Data Analysis (EDA) to uncover key insights. Key observations included:

  • Passenger class (Pclass), gender (Sex), and age (Age) strongly influenced survival rates.
  • Missing values in columns such as Age and Cabin required imputation strategies.

Handling Missing Data

  • Age: Imputed using median values based on gender and passenger class.
  • Cabin: Dropped due to a high percentage of missing values.

Encoding Categorical Data

Label encoding was applied to categorical features like Sex, Embarked, and Pclass.

4. Model Selection and Evaluation

I evaluated several models, including:

  1. Logistic Regression
  2. Random Forest Classifier
  3. XGBoost
  4. CatBoost
  5. Neural Networks

Evaluation Metrics

  • Accuracy
  • Precision, Recall, and F1-score
  • Confusion Matrix
  • ROC-AUC Curves

Insights

  • Metrics Comparison for Different Models
metrics table for comparison
  • Confusion Matrices

confusion matrix plot

  • ROC-AUC

roc-auc curve

Key Results

The highest accuracy was achieved with the following model configuration:

  • Model: Tunned Random Forest 
  • Accuracy: 90.9 %

5. Kaggle Submission and Results

Upon submitting the predictions to Kaggle, I achieved a respectable prediction score of 0.77, demonstrating the effectiveness of the model in classifying survival outcomes.

6. Visualizations

To better understand the model's performance, confusion matrices, ROC-AUC curves, and feature importances were visualized. Below are key plots:

  • Confusion Matrix
  • ROC-AUC Curve

7. Streamlit App Deployment

The project was further enhanced by deploying a Streamlit app, making it accessible for real-time predictions. Check out the live app here.

8. Conclusion

This project provided valuable insights into the predictive factors for Titanic survival. Building and evaluating various models helped fine-tune the prediction strategy, ultimately leading to an optimized solution.


About the Author

Results-driven Data Analyst with expertise in SQL, Power BI, Tableau, and Excel. Proven track record in data extraction, cleaning, and analysis, driving data-driven decisions. Skilled in collaborating with cross-functional teams to enhance data quality and deliver actionable insights.
linkedin

Post a Comment

Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.