Credit Card Fraud Detection

A fraud classification pipeline with SMOTE-based imbalance handling, optimized for recall on a highly skewed real-world dataset.

PythonPandasScikit-LearnImbalance Handling

View Source

Problem

Credit card fraud costs institutions billions annually. Detecting it is fundamentally an anomaly detection problem — legitimate transactions outnumber fraudulent ones by orders of magnitude, making standard accuracy a misleading metric.

Dataset & Class Imbalance

The dataset contained anonymized transaction features with fraud cases constituting less than 1% of records. Three strategies were evaluated to address this:

SMOTE (Synthetic Minority Over-sampling Technique)
Majority class undersampling
Class-weighted loss functions

Approach

Preprocessing: Scaled numerical features and handled missing values using pandas.
Feature Engineering: Analyzed correlation matrices to identify the most predictive features and remove noise.
Model Evaluation: Compared Logistic Regression, Random Forest, and Gradient Boosting classifiers.

Evaluation Strategy

Accuracy is misleading on imbalanced datasets. The pipeline was optimized for:

Recall: Minimizes missed fraud cases — the highest-cost error in production.
Precision-Recall AUC: A more reliable performance measure than ROC-AUC for skewed class distributions.

Next Steps

Explore unsupervised anomaly detection with Isolation Forests or One-Class SVMs.
Deploy as a microservice for real-time transaction scoring via REST API.