Movie Genre Classification

A multi-label NLP classifier using TF-IDF and supervised learning to predict movie genres from plot synopses.

PythonTF-IDFNLPMulti-label Classification

Problem

Classifying movies into genres from textual synopses is a multi-label NLP problem — a single movie can belong to several genres simultaneously (e.g., Action, Sci-Fi, Thriller). The challenge is building a model that handles this overlap reliably.

Approach

Text Preprocessing: Tokenization, stop-word removal, and lemmatization of plot synopses.
Vectorization: TF-IDF converts cleaned text into a weighted feature matrix, scoring terms by their relevance to each document relative to the full corpus.
Multi-Label Classification: MultiOutputClassifier wrapping a Support Vector Machine to predict multiple genre labels per synopsis.

Evaluation Metrics

Standard accuracy is insufficient for multi-label tasks. The model was evaluated using:

Hamming Loss: Fraction of labels incorrectly predicted — lower is better.
Micro/Macro F1-Score: Captures performance across both frequent and rare genre classes.

Next Steps

Replace TF-IDF with pre-trained word embeddings (Word2Vec, GloVe) for richer semantic representation.
Fine-tune a BERT-based model for deeper contextual understanding of plot descriptions.