Multi-Model Movie Genre Classification

In the digital entertainment industry, automated genre classification plays a pivotal role in organizing content libraries, enhancing user experience through recommendation systems, and enabling better content discovery. Traditional genre classification using unimodal approaches, particularly based on text, has had success, especially when reliable metadata is available. In this project, we focused on a text-based multi-label classification system to predict genres of movies based on their textual metadata, particularly the movie overview.Using this approach, the model attempts to assign one or more genres (such as Action, Comedy, Drama, etc.) to each movie. Unlike unimodal fusion systems, this baseline text-only model sets the foundation for future comparisons with multimodal models (e.g., text + image fusion).

Project Goals: The primary objectives of this text-based genre classification project are:

  1. To collect and clean a movie metadata dataset using the TMDb API.
  2. To extract poster images alongside text metadata for extended use.
  3. To perform exploratory data analysis on the overview and genre fields.
  4. To implement NLP preprocessing pipelines for movie overview text.
  5. To train and evaluate a deep learning model (LSTM-based) on the genre prediction task.
  6. To establish baseline performance metrics to be compared with more advanced multimodal models.

Data Collection and Preprocessing

Our dataset was curated from the TMDb (The Movie Database) API using Python scripts. Initially, we collected core movie metadata such as:

output image before Cleaning

After dropping empty rows or values, we got:

output image before Cleaning

However, during early training stages, we discovered that overviews alone were not descriptive enough for genre prediction, especially for movies with vague or generic plots. This led us to extract an additional field: Taglines, which are short promotional phrases (e.g., “The world will be watching”).

Before Cleaning After Cleaning

We augmented our dataset by:

After filtering, the final dataset size was reduced to approximately 21,554 high-quality samples with complete overview, tagline, genre labels, and poster paths.

output image before Cleaning

Genre Label Refinement

The original genre list contained 20+ labels, including many with extremely low frequency (e.g., Western, War, History, TV Movie). To improve model learning and generalization, we:

Data Cleaning Process

Exploratory Data Analysis (EDA)

Handling Class Imbalance

Our genre distribution was highly skewed. To manage this:

Multi-Label Binarization and Final Dataset

In this task, each movie could belong to multiple genres (e.g., "Action", "Comedy", and "Adventure" together). Therefore, we approached this as a multi-label classification problem (as opposed to single-label or multi-class).

To convert genre lists into machine-readable format, we used Scikit-learn’s MultiLabelBinarizer. This created a binary matrix where:

For example, a movie with genres [“Action”, “Thriller”] would have a vector like:

[0, 1, 0, 1, 0, ..., 0]

After this transformation:

This binarized label format enabled training with BCEWithLogitsLoss and allowed per-genre metrics like precision, recall, and F1 score.

Additionally, this structure made it easier to apply:

Final Dataset Overview

Text-Based Movie Genre Classification

Objective: Build a robust multi-label classifier that predicts movie genres based solely on the overview text field, using advanced NLP methods.

Dataset Used

Initially, the dataset had approximately 30,000 movies, but not all entries were complete or reliable for text-based prediction. During early experiments with models like GRU and LSTM, we observed poor performance and low F1 scores. These models struggled to learn meaningful patterns due to insufficient semantic context from the overview alone.

To improve the results, we augmented each movie’s overview with its tagline — a short descriptive sentence often associated with the movie. We concatenated the overview and tagline fields to enrich the input. However, many entries were missing taglines, so we dropped movies with null taglines to maintain consistency. This reduced the dataset size to 21,554 clean samples.

Each movie had a corresponding list of genres. These were cleaned and consolidated into 18 classes using genre merging and network analysis in the preprocessing phase. We applied MultiLabelBinarizer from scikit-learn to transform the genre list into binary vectors.

Final dataset shape: X_text.shape = (21554,), Y.shape = (21554, 18)

Tokenization

We used Hugging Face’s distilbert-base-uncased tokenizer to convert the combined text (overview + tagline) into token sequences. Steps included:

Model Architecture

The text classifier was built by fine-tuning a pretrained DistilBERT model. The architecture is as follows:

Text Model Architecture

Loss Function and Optimization

Training and Refinement

We trained the model for 25 epochs using a batch size = 64. During each epoch, we logged training and validation loss and macro F1-score. Model checkpoints were saved based on minimum validation loss.

After initial training, we reloaded the best checkpoint and continued training for 5 more epochs to refine the model. These refinement epochs used a reduced learning rate and helped improve generalization. Prediction comparison

Threshold Optimization

Since this was a multi-label problem, the output layer generated probabilities. We experimented with multiple thresholding strategies: output image before Cleaning

Evaluation Metrics

Text Model Accuracy

Interpretability & Verification

We conducted multiple checks to ensure the model was learning meaningful patterns:

Key Takeaways

Fusion-Based Movie Genre Classification

Objective: To improve genre prediction performance by integrating both text (overview + tagline) and image (poster) data into a single multimodal model.

Why Fusion?

While text-only models like BiLSTM and DistilBERT were effective in identifying genres from overviews, they often struggled with vague descriptions or when critical genre-specific terms were missing. Similarly, image-only models lacked semantic context. By combining both, we aimed to capture complementary signals—textual narrative and visual cues—thereby improving classification.

Fusion Methods: Concatenation-Based Feature Fusion

Step 1: Preparation for Input

The input image and text undergo separate preprocessing pipelines. Images are passed through a pretrained ResNet-18 model to extract high-level visual features. Simultaneously, the movie overview text is tokenized using a BERT tokenizer to produce input_ids and attention_mask required for the transformer model.

Step 2: Feature Extraction

Visual features are extracted by forwarding the image through ResNet-18, where the final classification layer is replaced with a Linear layer to obtain a 256-dimensional feature vector.

Text features are extracted by passing the tokenized text through the BERT model. The embedding corresponding to the [CLS] token (also called pooler_output) is taken, which is a 768-dimensional representation. This is then passed through another Linear layer to reduce it to a 256-dimensional text feature vector.

Step 3: Feature Fusion

The 256-dimensional image feature and 256-dimensional text feature are concatenated along the feature dimension. This results in a 512-dimensional combined feature vector that holds both visual and semantic information from the poster and overview respectively.

Step 4: Joint Classification

The fused 512-dimensional feature vector is then passed through the classification head for multi-label prediction. The classification head includes:

The sigmoid layer allows independent probabilities for each genre, making it suitable for multi-label classification where genres are not mutually exclusive.

Dataset Preparation (from fusion_data_cleaning.ipynb)

Preprocessing Steps

Model Architecture (from fusion_code.ipynb)

Architecture diagram Architecture diagram

Training Configuration

The model performed movie genre classification by uniting ResNet18 for image feature extraction with BERT for text feature extraction. Both modalities share a classifier to which their respective features have been joined through vector concatenation with dimension 256. During training the model implemented Binary Cross Entropy loss combined with Adam optimizer for 11 epochs. After training the model showed the improvement very progressively when coming to its accuracy , precision and recall metrices through its training epochs which will be overall an accuracy measure around 76.7% adn with the precision of 92% and recall score of 87% and f1 score around 92.47% . which will reflect good multi label classification abilities.

Results

Validation graph

Confusion Matrix & Genre-wise Behavior

Genres such as Comedy, Drama, and Action had high true positive rates. In contrast, classes like TV Movie, Western, and War had lower recall. Fusion helped balance performance across frequent and rare genres.

Evaluation Strategy

The evaluation model achieved reasonable generalization results when tested on the test set. The model demonstrates a moderate average error rate against unseen data because its test loss reaches 0.2338. The model produced 61.00% accuracy which indicates it managed to predict genres effectively for more than fifty percent of testing examples. The model exhibits better precision than recall when handling multi-label classification problems due to its 67% precision and 63% recall scores on the test set. This model shows adequate predictive power nonetheless it needs enhancement particularly targeting improved recall for better genre identification across different test samples.

Inference Script (fusion_inference__script.ipynb)

The fusion inference script enables real-time prediction of movie genres using both a poster image and its textual overview. It encapsulates the exact data processing pipeline used during training to ensure consistency and reproducibility.

Core Inference Pipeline

  1. Model Loading:
    Loads the best-performing trained model final_model.pt and sets it to evaluation mode using model.eval(). It also loads the MultiLabelBinarizer instance to reverse-map genre predictions.
  2. Text Tokenization:
    The movie overview is tokenized using the same distilbert-base-uncased tokenizer from HuggingFace Transformers.
    • Max length = 128
    • Padding and truncation are applied
    • Converted to PyTorch tensors for GPU/CPU inference
  3. Image Preprocessing:
    The poster is opened using PIL and transformed using:
    • Resize(224, 224)
    • ToTensor()
    • Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    These are the same transformations applied during training.
  4. Forward Pass:
    The script combines text and image embeddings, passes them through the model, and applies a sigmoid activation to get probabilities across 18 genres.
  5. Decoding:
    Final genre indices are mapped to human-readable labels using the saved MultiLabelBinarizer.classes_ list.

Sample Usage

python fusion_inference.py \
          --image_path posters/sample.jpg \
          --overview "A young boy discovers he has magical powers and joins a hidden wizarding school."

Output

Sample Inference Output

Design Decisions & Robustness

Extensibility Tips

Key Learnings

Conclusion

The multimodal fusion model demonstrated significant improvement over text-only models, showing how visual and textual information complement each other in genre classification. It closely mimics how human viewers intuitively predict genres using both narrative and visuals.

Comparison & Future Work

Model Comparison

Text-Only Model (DistilBERT):

Fusion Model (DistilBERT + ResNet18):

Conclusion

The text-only model helped us understand how well genres can be predicted from descriptions alone. However, some genres were hard to identify using text. So, we fused image and text features to improve performance. The fusion model performed much better overall, capturing both visual and semantic signals for genre classification.

Future Work

Reproducibility & Instructions

GitHub Repository: https://github.com/adullagayathri/multimodal-movie-genre-prediction.git

Setup Instructions

  1. Clone the repository:
    git clone https://github.com/adullagayathri/multimodal-movie-genre-prediction.git
  2. Navigate to the project folder:
    cd multimodal-movie-genre-prediction
  3. Create a virtual environment (recommended):
    python -m venv venv && source venv/bin/activate
  4. Install dependencies:
  5. Download or generate the dataset (instructions in data/README.md) and download tmdb_cleaned_final.csv and posters folder
  6. Train text model: run data_text_model.ipynb
  7. Train fusion model:run fusion_data_cleaning.ipynb then continued by fusion_code.ipynb followed by inference_fusion.ipynb
  8. Evaluate the model and check best models of both in your same folder and verify results
  9. To view website locally (optional):
    cd website
    Open index.html in browser

References

Team Contributions