Credit Card Fraud Detection
A fraud classification pipeline with SMOTE-based imbalance handling, optimized for recall on a highly skewed real-world dataset.
PythonPandasScikit-LearnImbalance Handling
Problem
Credit card fraud costs institutions billions annually. Detecting it is fundamentally an anomaly detection problem — legitimate transactions outnumber fraudulent ones by orders of magnitude, making standard accuracy a misleading metric.
Dataset & Class Imbalance
The dataset contained anonymized transaction features with fraud cases constituting less than 1% of records. Three strategies were evaluated to address this:
- SMOTE (Synthetic Minority Over-sampling Technique)
- Majority class undersampling
- Class-weighted loss functions
Approach
- Preprocessing: Scaled numerical features and handled missing values using
pandas. - Feature Engineering: Analyzed correlation matrices to identify the most predictive features and remove noise.
- Model Evaluation: Compared Logistic Regression, Random Forest, and Gradient Boosting classifiers.
Evaluation Strategy
Accuracy is misleading on imbalanced datasets. The pipeline was optimized for:
- Recall: Minimizes missed fraud cases — the highest-cost error in production.
- Precision-Recall AUC: A more reliable performance measure than ROC-AUC for skewed class distributions.
Next Steps
- Explore unsupervised anomaly detection with Isolation Forests or One-Class SVMs.
- Deploy as a microservice for real-time transaction scoring via REST API.
