Project Overview
A machine learning system that predicts individual medical costs based on demographic and lifestyle factors. Using Kaggle's insurance dataset, the model identifies key cost drivers and provides interpretable insights for healthcare planning.
Technical Implementation
- ▹ Feature engineering with BMI-age interactions
- ▹ Categorical encoding for smoking status and regions
- ▹ Multiple linear regression model
- ▹ Comprehensive model interpretation
Prediction Sequence
Core Features
Factor Analysis
Identifies key cost drivers including smoking status, BMI, and age with coefficient analysis
Smart Features
Engineered interaction terms (BMI×Age) and regional cost analysis
High Accuracy
Achieves R² score of 0.866 on test data with $4,567 RMSE
Prediction API
REST API endpoint for cost predictions with demographic inputs
Technical Deep Dive
Modeling Pipeline
- Data cleaning & missing value imputation
- Categorical feature encoding
- Interaction term creation
- Train-test split (80-20)
- Model training & interpretation
Performance Metrics
Challenges & Solutions
⚠️ Data Quality
Implemented thorough EDA and outlier analysis to ensure data integrity for modeling
⚡ Model Interpretability
Used coefficient analysis and correlation matrices to explain feature importance