Movie Genre Classification
A multi-label NLP classifier using TF-IDF and supervised learning to predict movie genres from plot synopses.
PythonTF-IDFNLPMulti-label Classification
Problem
Classifying movies into genres from textual synopses is a multi-label NLP problem — a single movie can belong to several genres simultaneously (e.g., Action, Sci-Fi, Thriller). The challenge is building a model that handles this overlap reliably.
Approach
- Text Preprocessing: Tokenization, stop-word removal, and lemmatization of plot synopses.
- Vectorization: TF-IDF converts cleaned text into a weighted feature matrix, scoring terms by their relevance to each document relative to the full corpus.
- Multi-Label Classification:
MultiOutputClassifierwrapping a Support Vector Machine to predict multiple genre labels per synopsis.
Evaluation Metrics
Standard accuracy is insufficient for multi-label tasks. The model was evaluated using:
- Hamming Loss: Fraction of labels incorrectly predicted — lower is better.
- Micro/Macro F1-Score: Captures performance across both frequent and rare genre classes.
Next Steps
- Replace TF-IDF with pre-trained word embeddings (Word2Vec, GloVe) for richer semantic representation.
- Fine-tune a BERT-based model for deeper contextual understanding of plot descriptions.
