Project Overview
A machine learning pipeline that predicts sales for 1,559 products across 10 stores. The system identifies key drivers of retail performance and provides actionable insights for inventory optimization using advanced feature engineering and gradient boosting.
Technical Implementation
- ▹ Automated missing value treatment (mean/mode imputation)
- ▹ IQR-based outlier detection and capping
- ▹ Label encoding for categorical features
- ▹ XGBoost regression with hyperparameter tuning
Key Features
Data Cleaning
Handles 17% missing data in item weights and 28% missing store size values using intelligent imputation strategies
Feature Engineering
Transforms 8 categorical variables into numerical features using label encoding
Model Performance
Achieves R² score of 0.67 on training data and 0.57 on test data
Insight Generation
Identifies key factors impacting sales through feature importance analysis
Technical Deep Dive
Data Pipeline
- Exploratory data analysis
- Outlier treatment using IQR
- Categorical feature encoding
- Train-test split (80:20)
- Model training & evaluation
Performance Metrics
Challenges & Solutions
⚠️ Missing Data
Implemented hybrid imputation strategy using mean for numerical features and mode-based imputation for categorical features
⚡ Model Accuracy
Improved performance through feature engineering and hyperparameter tuning of XGBoost regressor