SMS Spam Detection

An ML classification system to identify and filter spam SMS messages with 98.2% accuracy and 99.1% precision.

PythonScikit-LearnLogistic RegressionNLP

Problem

SMS spam remains a persistent problem. The dataset contains thousands of messages labeled as either spam or ham. The objective was to build a high-precision classifier that protects end-users from unwanted messages without incorrectly filtering legitimate ones.

Dataset

The SMS Spam Collection dataset — a publicly available benchmark containing tagged SMS messages collected for spam research.

Approach

Text Preprocessing: Lowercasing, punctuation removal, and stemming.
Feature Extraction: TfidfVectorizer converts text into weighted numerical features. TF-IDF down-weights common words and amplifies rare, highly predictive terms.
Model: LogisticRegression wrapped in a Pipeline for clean, reproducible inference.

Model

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', LogisticRegression(random_state=42))
])

pipeline.fit(X_train, y_train)

Results

Accuracy: 98.2%
Precision (Spam): 99.1%
Recall (Spam): 89.5%
F1-Score: 94.0%

High precision was the primary objective — a false positive (flagging a real message as spam) is far more damaging to user trust than a missed spam message.

Next Steps

Upgrade to a transformer-based model (e.g., DistilBERT) for multilingual and context-aware classification.
Wrap the pipeline in a FastAPI microservice for real-time inference.