Movie Genre Classification

In the digital entertainment industry, automated genre classification plays a pivotal role in organizing content libraries, enhancing user experience through recommendation systems, and enabling better content discovery. Traditional genre classification using unimodal approaches, particularly based on text, has had success, especially when reliable metadata is available. In this project, we focused on a text-based multi-label classification system to predict genres of movies based on their textual metadata, particularly the movie overview.Using this approach, the model attempts to assign one or more genres (such as Action, Comedy, Drama, etc.) to each movie. Unlike unimodal fusion systems, this baseline text-only model sets the foundation for future comparisons with multimodal models (e.g., text + image fusion).

Project Goals: The primary objectives of this text-based genre classification project are:

To collect and clean a movie metadata dataset using the TMDb API.
To extract poster images alongside text metadata for extended use.
To perform exploratory data analysis on the overview and genre fields.
To implement NLP preprocessing pipelines for movie overview text.
To train and evaluate a deep learning model (LSTM-based) on the genre prediction task.
To establish baseline performance metrics to be compared with more advanced multimodal models.

Data Collection and Preprocessing

Our dataset was curated from the TMDb (The Movie Database) API using Python scripts. Initially, we collected core movie metadata such as:

Movie Title
Overview (plot summary)
Genres
Poster Path

After dropping empty rows or values, we got:

However, during early training stages, we discovered that overviews alone were not descriptive enough for genre prediction, especially for movies with vague or generic plots. This led us to extract an additional field: Taglines, which are short promotional phrases (e.g., “The world will be watching”).

→

We augmented our dataset by:

Re-querying TMDb API to fetch taglines for all movies

Removing entries with missing taglines
Dropping movies with null or empty overviews, genres, or poster paths

Standardizing and combining overview + tagline for improved input quality

After filtering, the final dataset size was reduced to approximately 21,554 high-quality samples with complete overview, tagline, genre labels, and poster paths.

Genre Label Refinement

The original genre list contained 20+ labels, including many with extremely low frequency (e.g., Western, War, History, TV Movie). To improve model learning and generalization, we:

Analyzed genre frequency distribution

Merged low-frequency or overlapping genres into a new class: "Others"

→

Finalized a cleaned set of 12 genre labels suitable for classification

Data Cleaning Process

Duplicates Removed: Ensured that duplicate movie IDs or titles were dropped
Null Handling: Removed rows with missing genres, overview, tagline, or posters
Poster Verification: Checked if the poster was actually downloaded and readable using PIL

Exploratory Data Analysis (EDA)

Genre Frequency Plot: Highlighted the overrepresentation of Drama, Comedy, and Action

→

Genre co-occurence heatmap Visualized genre co-occurence before and after balancing data

→

Multi-Label Distribution: Found the average number of genres per movie to be 2.36

principal component analysis of genre vectors:

Overview Length: Analyzed token counts and filtered extreme cases

Genre co-occurence network Visualized genre co-occurence network before and after balancing data

→

Handling Class Imbalance

Our genre distribution was highly skewed. To manage this:

Augmentation: Downloaded additional movies for underrepresented genres
Genre Merging: Combined very sparse classes into “Others”
Graph-Based Pruning: Constructed a genre co-occurrence network and removed weakly connected nodes
Label Binarization: Used MultiLabelBinarizer to prepare label matrix (shape: (21554, 18))

Multi-Label Binarization and Final Dataset

In this task, each movie could belong to multiple genres (e.g., "Action", "Comedy", and "Adventure" together). Therefore, we approached this as a multi-label classification problem (as opposed to single-label or multi-class).

To convert genre lists into machine-readable format, we used Scikit-learn’s MultiLabelBinarizer. This created a binary matrix where:

Each column represents a genre label (e.g., “Drama”, “Horror”)
Each row represents a movie
Value = 1 if the movie belongs to that genre, else 0

For example, a movie with genres [“Action”, “Thriller”] would have a vector like:

[0, 1, 0, 1, 0, ..., 0]

After this transformation:

Final label matrix shape: (21554, 18)
Number of genres encoded: 18 (before final genre merging and consolidation into 12 classes)
Final dataset used: tmdb_cleaned_final.csv

This binarized label format enabled training with BCEWithLogitsLoss and allowed per-genre metrics like precision, recall, and F1 score.

Additionally, this structure made it easier to apply:

Stratified batching to balance genre occurrence
Threshold tuning for genre-wise decision boundaries
Multi-label evaluation metrics such as Hamming loss, subset accuracy, and Jaccard index

Final Dataset Overview

Number of Samples: 21,554 movies
Final Genres: 12, including merged “Others”
Metadata: Clean overview + tagline, valid poster, and genre list
Poster Storage: ~21k resized 224x224 .jpg images in posters/ folder

Text-Based Movie Genre Classification

Objective: Build a robust multi-label classifier that predicts movie genres based solely on the overview text field, using advanced NLP methods.

Dataset Used

Initially, the dataset had approximately 30,000 movies, but not all entries were complete or reliable for text-based prediction. During early experiments with models like GRU and LSTM, we observed poor performance and low F1 scores. These models struggled to learn meaningful patterns due to insufficient semantic context from the overview alone.

To improve the results, we augmented each movie’s overview with its tagline — a short descriptive sentence often associated with the movie. We concatenated the overview and tagline fields to enrich the input. However, many entries were missing taglines, so we dropped movies with null taglines to maintain consistency. This reduced the dataset size to 21,554 clean samples.

Each movie had a corresponding list of genres. These were cleaned and consolidated into 18 classes using genre merging and network analysis in the preprocessing phase. We applied MultiLabelBinarizer from scikit-learn to transform the genre list into binary vectors.

Final dataset shape: X_text.shape = (21554,), Y.shape = (21554, 18)

Tokenization

We used Hugging Face’s distilbert-base-uncased tokenizer to convert the combined text (overview + tagline) into token sequences. Steps included:

Lowercasing and normalization
Truncating or padding sequences to max_length = 256
Generating input_ids and attention_masks
Using a custom PyTorch Dataset class to wrap tokenized inputs and labels

Model Architecture

The text classifier was built by fine-tuning a pretrained DistilBERT model. The architecture is as follows:

Input: tokenized text passed into DistilBERT
CLS embedding extracted from final transformer layer
CLS → Linear(768 → 128) → Dropout(0.3) → Linear(128 → 18) → Sigmoid
Output: 18 genre probabilities (one for each label)

Loss Function and Optimization

BCEWithLogitsLoss was used for multi-label classification
Positive weights were calculated per genre to reduce bias toward dominant genres
Optimizer: AdamW with learning rate = 2e-5
Scheduler: CosineAnnealingLR to decay learning rate across epochs

Training and Refinement

We trained the model for 25 epochs using a batch size = 64. During each epoch, we logged training and validation loss and macro F1-score. Model checkpoints were saved based on minimum validation loss.

After initial training, we reloaded the best checkpoint and continued training for 5 more epochs to refine the model. These refinement epochs used a reduced learning rate and helped improve generalization. Prediction comparison

Threshold Optimization

Since this was a multi-label problem, the output layer generated probabilities. We experimented with multiple thresholding strategies: output image before Cleaning

Global Threshold: 0.5 applied to all labels initially
Top-K Strategy: Always selecting top-3 genres with probabilities above 0.3

Evaluation Metrics

Macro F1 Score (Validation): ~0.46
Macro F1 Score (Test): ~0.40
Hamming Accuracy: 0.7136
Jaccard Score: 0.1562
Exact Match Accuracy: 0.0213

Interpretability & Verification

We conducted multiple checks to ensure the model was learning meaningful patterns:

ROC-AUC curves for each genre on validation and test sets

Genre-wise Confusion Matrices

Predicted Positives vs True Positives (PP vs TP) monitoring

Visual inspection of predicted genres for random movie samples

F1 score per genre on test set:

Genre-wise F1 score comparision:

Genre Co-occurence network

Key Takeaways

Using just the overview wasn’t enough — adding taglines significantly improved prediction quality
GRU and LSTM models lacked contextual depth and failed to capture genre-specific semantics
DistilBERT offered much richer text representations with fewer parameters
refinement training improved macro F1 over baseline
Still, there is room to grow — especially in handling low-resource genres

Fusion-Based Movie Genre Classification

Objective: To improve genre prediction performance by integrating both text (overview + tagline) and image (poster) data into a single multimodal model.

Why Fusion?

While text-only models like BiLSTM and DistilBERT were effective in identifying genres from overviews, they often struggled with vague descriptions or when critical genre-specific terms were missing. Similarly, image-only models lacked semantic context. By combining both, we aimed to capture complementary signals—textual narrative and visual cues—thereby improving classification.

Fusion Methods: Concatenation-Based Feature Fusion

Step 1: Preparation for Input

The input image and text undergo separate preprocessing pipelines. Images are passed through a pretrained ResNet-18 model to extract high-level visual features. Simultaneously, the movie overview text is tokenized using a BERT tokenizer to produce input_ids and attention_mask required for the transformer model.

Step 2: Feature Extraction

Visual features are extracted by forwarding the image through ResNet-18, where the final classification layer is replaced with a Linear layer to obtain a 256-dimensional feature vector.

Text features are extracted by passing the tokenized text through the BERT model. The embedding corresponding to the [CLS] token (also called pooler_output) is taken, which is a 768-dimensional representation. This is then passed through another Linear layer to reduce it to a 256-dimensional text feature vector.

Step 3: Feature Fusion

The 256-dimensional image feature and 256-dimensional text feature are concatenated along the feature dimension. This results in a 512-dimensional combined feature vector that holds both visual and semantic information from the poster and overview respectively.

Step 4: Joint Classification

The fused 512-dimensional feature vector is then passed through the classification head for multi-label prediction. The classification head includes:

A Linear(512 → 128) layer
ReLU activation
Dropout to reduce overfitting
Final Linear(128 → 18) output layer followed by a Sigmoid activation

The sigmoid layer allows independent probabilities for each genre, making it suitable for multi-label classification where genres are not mutually exclusive.

Dataset Preparation (from `fusion_data_cleaning.ipynb`)

Started with the cleaned CSV containing overview, tagline, genres, and poster_path.
Dropped rows missing either taglines or posters.
Performed exploratory validation to ensure each row had valid and accessible data for both modalities.
Stratified multi-label splitting was used to preserve genre distribution across train, val, test sets.
Final dataset shape: 21,554 rows, 18 binarized genres.

Preprocessing Steps

Text: Overview and tagline concatenated → tokenized using DistilBERT tokenizer with max_length=128 → padded & truncated → returned as input_ids and attention_mask.
Image: Posters resized to 224x224 → converted to tensor → normalized using ImageNet stats. Augmentations included:
- ColorJitter, RandomAffine, and HorizontalFlip during training only
Custom PyTorch Dataset: Returns a tuple of (tokenized text, image tensor, genre labels).

Model Architecture (from `fusion_code.ipynb`)

Text Path: DistilBERT → Extract [CLS] embedding → Linear(768→256)
Image Path: ResNet18 (pretrained) → AdaptiveAvgPool2d → Flatten → Linear(512→256)
Fusion: Concatenated vector [256+256] → Linear(512→128) → Dropout(0.3) → Linear(128→18) → Sigmoid
Output: Probabilities across all 18 genres (multi-label)

Training Configuration

The model performed movie genre classification by uniting ResNet18 for image feature extraction with BERT for text feature extraction. Both modalities share a classifier to which their respective features have been joined through vector concatenation with dimension 256. During training the model implemented Binary Cross Entropy loss combined with Adam optimizer for 11 epochs. After training the model showed the improvement very progressively when coming to its accuracy , precision and recall metrices through its training epochs which will be overall an accuracy measure around 76.7% adn with the precision of 92% and recall score of 87% and f1 score around 92.47% . which will reflect good multi label classification abilities.

Loss: BCEWithLogitsLoss (without class weights)
Optimizer: Adam (learning rate = 1e-4)
Epochs: 11 total
Batch Size: 32
Validation macro F1 monitored manually to determine best epoch

Results

Validation F1 Score: ~0.92
Test F1 Score: ~0.90
Precision: ~92%, Recall: ~87%
Test Accuracy: ~61%
Average genres per movie: 2.36
Strategy used: Top-3 predicted genres with probability ≥ 0.3

Confusion Matrix & Genre-wise Behavior

Genres such as Comedy, Drama, and Action had high true positive rates. In contrast, classes like TV Movie, Western, and War had lower recall. Fusion helped balance performance across frequent and rare genres.

Evaluation Strategy

The evaluation model achieved reasonable generalization results when tested on the test set. The model demonstrates a moderate average error rate against unseen data because its test loss reaches 0.2338. The model produced 61.00% accuracy which indicates it managed to predict genres effectively for more than fifty percent of testing examples. The model exhibits better precision than recall when handling multi-label classification problems due to its 67% precision and 63% recall scores on the test set. This model shows adequate predictive power nonetheless it needs enhancement particularly targeting improved recall for better genre identification across different test samples.

Macro F1, Hamming accuracy, and Jaccard used for multi-label evaluation and results are:
Subset accuracy used for strict correctness analysis
Threshold tuning and confusion matrix plotting helped refine final model

Inference Script (`fusion_inference__script.ipynb`)

The fusion inference script enables real-time prediction of movie genres using both a poster image and its textual overview. It encapsulates the exact data processing pipeline used during training to ensure consistency and reproducibility.

Core Inference Pipeline

Model Loading:
Loads the best-performing trained model final_model.pt and sets it to evaluation mode using model.eval(). It also loads the MultiLabelBinarizer instance to reverse-map genre predictions.
Text Tokenization:
The movie overview is tokenized using the same distilbert-base-uncased tokenizer from HuggingFace Transformers.
- Max length = 128
- Padding and truncation are applied
- Converted to PyTorch tensors for GPU/CPU inference
Image Preprocessing:
The poster is opened using PIL and transformed using:
- Resize(224, 224)
- ToTensor()
- Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
These are the same transformations applied during training.
Forward Pass:
The script combines text and image embeddings, passes them through the model, and applies a sigmoid activation to get probabilities across 18 genres.
Decoding:
Final genre indices are mapped to human-readable labels using the saved MultiLabelBinarizer.classes_ list.

Sample Usage

python fusion_inference.py \
          --image_path posters/sample.jpg \
          --overview "A young boy discovers he has magical powers and joins a hidden wizarding school."

Output

Returns a list of top 3 predicted genres such as: ['Fantasy', 'Adventure', 'Family']

Genres are printed to console and optionally saved with their probabilities

Design Decisions & Robustness

Handles missing or corrupt poster files with error messages
Validates text length and padding before tokenization
Includes batch dimension reshaping to support both single and batch mode
GPU acceleration used if available, else defaults to CPU

Extensibility Tips

Can be extended into a Flask/FastAPI web server with UI uploads
Post-processing can be added to filter low-confidence predictions
Could return genre-specific confidence bars or tag visual markers on UI

Key Learnings

Fusing images and overview+tagline helped the model infer genre better for edge cases.
Image augmentations improved generalization for posters with minimal visual clues.
Tagline addition gave rich textual signals when overviews were generic.
Fusion was critical for difficult-to-identify genres like Fantasy and Sci-Fi.

Conclusion

The multimodal fusion model demonstrated significant improvement over text-only models, showing how visual and textual information complement each other in genre classification. It closely mimics how human viewers intuitively predict genres using both narrative and visuals.

Comparison & Future Work

Model Comparison

Text-Only Model (DistilBERT):

Validation Macro F1: ~0.46
Test Macro F1: ~0.40
Precision and recall moderate across genres
Used only movie overviews (text)

Fusion Model (DistilBERT + ResNet18):

Validation Macro F1: ~0.92
Test Macro F1: ~0.90
High precision (~92%) and recall (~87%)
Used both movie overviews and posters (text + image)

Conclusion

The text-only model helped us understand how well genres can be predicted from descriptions alone. However, some genres were hard to identify using text. So, we fused image and text features to improve performance. The fusion model performed much better overall, capturing both visual and semantic signals for genre classification.

Future Work

Use better image models like Vision Transformers
Try advanced fusion strategies like attention-based merging
Fine-tune the BERT model for improved text understanding
Build a real-time web app for automatic genre prediction
Explore sub-genre or multi-level genre prediction

Reproducibility & Instructions

GitHub Repository: https://github.com/adullagayathri/multimodal-movie-genre-prediction.git

Setup Instructions

Clone the repository:
git clone https://github.com/adullagayathri/multimodal-movie-genre-prediction.git
Navigate to the project folder:
cd multimodal-movie-genre-prediction
Create a virtual environment (recommended):
python -m venv venv && source venv/bin/activate
Install dependencies:
Download or generate the dataset (instructions in data/README.md) and download tmdb_cleaned_final.csv and posters folder
Train text model: run data_text_model.ipynb
Train fusion model:run fusion_data_cleaning.ipynb then continued by fusion_code.ipynb followed by inference_fusion.ipynb
Evaluate the model and check best models of both in your same folder and verify results
To view website locally (optional):
cd website
Open index.html in browser

References

Devlin et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [Link]
Sanh et al. (2019). DistilBERT: smaller, faster, cheaper and lighter. [Link]
He et al. (2016). Deep residual learning for image recognition. [Link]
Baltrušaitis et al. (2018). Multimodal Machine Learning: A Survey. [Link]
Zhang & Zhou (2014). A review on multi-label learning algorithms. [Link]

Team Contributions

Gayathri Adulla: Led dataset preparation, exploratory analysis, and full development of the text-based model (LSTM, GRU, DistilBERT).
Uma Maheshwar Reddy: Led the fusion model architecture, training, and evaluation combining text and image embeddings.
Both: Collaborated on designing, building, and deploying the project website with all results and explanations integrated.

Data Collection and Preprocessing

Genre Label Refinement

Data Cleaning Process

Exploratory Data Analysis (EDA)

Handling Class Imbalance

Multi-Label Binarization and Final Dataset

Final Dataset Overview

Text-Based Movie Genre Classification

Dataset Used

Tokenization

Model Architecture

Loss Function and Optimization

Training and Refinement

Threshold Optimization

Evaluation Metrics

Interpretability & Verification

Key Takeaways

Fusion-Based Movie Genre Classification

Why Fusion?

Fusion Methods: Concatenation-Based Feature Fusion

Step 1: Preparation for Input

Step 2: Feature Extraction

Step 3: Feature Fusion

Step 4: Joint Classification

Dataset Preparation (from fusion_data_cleaning.ipynb)

Preprocessing Steps

Model Architecture (from fusion_code.ipynb)

Training Configuration

Results

Confusion Matrix & Genre-wise Behavior

Evaluation Strategy

Inference Script (fusion_inference__script.ipynb)

Core Inference Pipeline

Sample Usage

Output

Design Decisions & Robustness

Extensibility Tips

Key Learnings

Conclusion

Comparison & Future Work

Model Comparison

Conclusion

Future Work

Reproducibility & Instructions

Setup Instructions

References

Team Contributions

Dataset Preparation (from `fusion_data_cleaning.ipynb`)

Model Architecture (from `fusion_code.ipynb`)

Inference Script (`fusion_inference__script.ipynb`)