In the era of personalization and targeted user experiences, gender prediction from names can play a significant role in various applications. This project focuses on building a machine-learning model to predict gender based on names using multiple classification algorithms.
Project Objectives
The goal was to build a robust machine learning pipeline that can accurately predict gender based on text input (names). The project aimed to explore different models and evaluate their performance.
Data Loading and Exploration
The dataset was loaded using pandas and inspected for its structure. The dataset contained names along with their corresponding gender labels.
Key steps in the exploration included:
- Understanding the distribution of gender labels
- Identifying missing or duplicate values, if any
- Analyzing the length and frequency of names for potential feature extraction
Data Preprocessing
Preprocessing steps were critical to prepare the dataset for machine learning models:
- Text Vectorization: The project used TF-IDF (Term Frequency-Inverse Document Frequency) to convert text features into numerical format suitable for machine learning.
- Encoding: The target labels were encoded using
LabelEncoder
. - Data Splitting: The dataset was split into training and testing sets for robust evaluation.
Model Training and Evaluation
The project experimented with various models, including:
- Logistic Regression: Known for its interpretability and effectiveness on linear datasets.
- Naive Bayes: A probabilistic classifier well-suited for text classification tasks.
- XGBoost: A powerful gradient boosting algorithm for higher accuracy and performance.
Evaluation Metrics
The models were evaluated using:
- Accuracy: Percentage of correct predictions
- Precision and Recall: To measure false positives and false negatives
- Confusion Matrix: Visualization of prediction results
- AUC-ROC Curve: Evaluating model performance across different thresholds
Results and Key Insights
The Logistic Regression model stood out with superior accuracy and balanced precision and recall scores. The confusion matrix and AUC-ROC curve provided additional insights into the model's decision boundaries.
Visuals
Here are the key plots that capture the performance of the model:
1. Accuracy Metrics Plot: