Understanding Logistic Regression for SMS Spam Detection

March 10, 2024

2 min read

Machine LearningPythonNLP

Why Logistic Regression Still Wins for Spam Detection

Despite the hype around deep learning, Logistic Regression remains one of the most effective algorithms for binary text classification. It is fast to train, easy to interpret, and performs exceptionally well when paired with strong feature engineering.

The Problem

The SMS Spam Collection dataset contains thousands of messages labeled as either spam or ham. The goal: build a classifier with high precision - meaning very few legitimate messages get incorrectly flagged as spam.

The Approach

Text Preprocessing: Lowercasing, punctuation removal, and stemming to normalize the vocabulary.
Feature Extraction: TfidfVectorizer converts raw text into a weighted numerical matrix. TF-IDF penalizes common words and rewards rare, discriminative terms - exactly what separates spam from legitimate messages.
Model Training: A LogisticRegression classifier wrapped in a Pipeline for clean, reproducible inference.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', LogisticRegression(random_state=42))
])

pipeline.fit(X_train, y_train)

Results

The model achieved 98% accuracy on the test set. More importantly, precision for the spam class was exceptionally high - meaning very few legitimate messages were accidentally filtered out, which is the critical requirement for any production spam system.

Takeaway

The best model is not always the most complex one. Logistic Regression with well-engineered TF-IDF features delivers a robust, interpretable, and production-ready baseline that is hard to beat on clean text classification tasks.