Code by TomMakesThings

About

This code was written by myself, but forms part of a group coursework. The aim of the project being to create a multi-label classifier that can predict the genre of film descriptions from IMDb. To find out more, visit https://tommakesthings.github.io/Movie-Genre-Predictor/.

In this notebook, a pipeline is created that when given training and testing data, will train and test a new classifier. This means that if the dataset is changed, the text processing and label conversion steps, as well as the model architecture will remain the same unless explicitly changed. After running the pipeline, the model state and the text processor from the pipeline are then saved to file so that they can be hosted on the group's web application.

Imports

The notebook has been developed using Python 3.8.5 and Anaconda3 with conda 4.10.1. If you would like to recreate the environment, the YAML file environment.yml can be found on GitHub. Using this, it is possible to recreate the environment using the command conda env create -f environment.yml through the Anaconda terminal. For more detail refer to the conda docs.

IMDb Dataset

Open the dataset, drop irrelevent columns and remove samples with missing information.

Analysing the Dataset

Metrics

Each film contains between 1 to 3 genres stored in genre. However, they are stored in the same column meaning some processing is required to separate each one to count the true number. For example, Drama, Romance and Drama, Romance would be counted as three different genres, even though only two unique genres are present.

Print information about the dataset:

View Label distribution

View the number of films belonging to each genre. Across all samples, the most common genres are Drama (26.7%), Comedy (16.4%) and Romance (8.0%). The least common is News with only one sample across the whole dataset. Other rare genres include Adult and Documentary which have two each, and Reality-TV which has three. After this is Film-Noir with 663. The rarest genres are removed later in the notebook as they do not provide enough samples to accurately train a classifier.

Sample Selection

Drop Samples

As the dataset is large, samples are dropped so that the notebook can run in a reasonable time. n_samples specifies the maximum number of samples to use. Setting n_samples = 0 will cause all suitable samples to be used, though the notebook will be slow to run.

min_length specifies the minimum number of words in a description and samples less than this will be removed.

Setting remove_rare_genres = True will remove genres with less than rare_count instances from the movies' labels. If a sample does not have a label for any remaining genre, it will be dropped. For example, if rare_count = 10, genres [News, Adult, Documentary, Reality-TV] are removed.

max_genre_samples is the maximum number of samples of each genre allowed. For example if max_genre_samples = 100, a maximum of 100 samples are allowed for Comedy, another 100 for Drama etc. This can be turned off by setting max_genre_samples = 0.

Split into Training and Testing

Split the selected samples into 80% training and 20% testing sets. This has been seeded using random_state=0 so that this split is reproducible.

Classifier Model

Seed PyTorch to make the results semi-reproducable.

Define a class for the LSTM classifier.

Pipeline

The pipeline is constructed from four custom transformer classes that process the training and testing data and train a new classifier.

Custom Transformer Classes

Description Transformer

Transformer class that will process the film description column of the training and testing data. This includes:

Genre Transformer

Transformer class that will process the film genre column of the training and testing data. Each film contains between 1 to 3 genres which will be converted into a multi-hot representation to use as labels for the classifier. The multi-hot encoder will be saved to file binary_encoder_file so that it can be loaded again to convert the model's predictions back to genres. For example if the unique genres were Comedy, Drama and Romance:

Post-processor Transformer

Transformer class that will convert the training / testing data to a suitable form for model training or testing. This includes:

Model Transformer

Transformer class that will create, train and test a new model.

For training, this includes:

For testing, this includes:

Model Arguments

Set the model and optimiser parameters:

Build a Pipeline

Construct the pipeline.

Setting verbose determines how much of the processes to display:

Pipeline Training

Run the pipeline to process the descriptions and genres of the training data, then create and train a new model.

Pipeline Testing

The model state to test can be either trained_model.pt or final_model.pt if a new model has just been trained.

Test the model on the testing data.

Test Predictions on Custom Descriptions

Enter an IMDb film / series description and see the predicted genre(s) for the newly trained model. For example:

Deploy Model as a Web Service

Set the state of the model to use for the application. This will overwrite the saved model state and text processor in the web application's directory so that it will be automatically updated.

To run the web server, open the command line and type python and full directory of file Web_App/flaskr/main.py. For example, python user/NLP_Project/Web_App/flaskr/main.py. Then visit http://127.0.0.1:5000/. Alternatively, if this does not work, the application can be run from a python IDE such as PyCharm using the Conda environment as an interpreter.