SMS Spam Detection
An ML classification system to identify and filter spam SMS messages with 98.2% accuracy and 99.1% precision.
PythonScikit-LearnLogistic RegressionNLP
Problem
SMS spam remains a persistent problem. The dataset contains thousands of messages labeled as either spam or ham. The objective was to build a high-precision classifier that protects end-users from unwanted messages without incorrectly filtering legitimate ones.
Dataset
The SMS Spam Collection dataset — a publicly available benchmark containing tagged SMS messages collected for spam research.
Approach
- Text Preprocessing: Lowercasing, punctuation removal, and stemming.
- Feature Extraction:
TfidfVectorizerconverts text into weighted numerical features. TF-IDF down-weights common words and amplifies rare, highly predictive terms. - Model:
LogisticRegressionwrapped in aPipelinefor clean, reproducible inference.
Model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000)),
('clf', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
Results
- Accuracy: 98.2%
- Precision (Spam): 99.1%
- Recall (Spam): 89.5%
- F1-Score: 94.0%
High precision was the primary objective — a false positive (flagging a real message as spam) is far more damaging to user trust than a missed spam message.
Next Steps
- Upgrade to a transformer-based model (e.g., DistilBERT) for multilingual and context-aware classification.
- Wrap the pipeline in a FastAPI microservice for real-time inference.
