Code by TomMakesThings

About

In this notebook, Latent Dirichlet allocation (LDiA) and Latent semantic analysis (LSA) topic modelling are performed on film descriptions from IMDb. Although these topic modelling algorithms do not classify descriptions into genres in the same way as supervised algorithms, such an LSTM classifier, they provide a method of dimensionality reduction and give a new method finding the most representitive tokens for each film description. The topics created from LDiA and LSA could potentially be used as labels to train a supervised model through a semi-supervised approach.

Imports

IMDb Dataset

Open the dataset, drop irrelevent columns and remove samples with missing information.

This dataset can be downloaded from Kaggle: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Analysing the Dataset

Metrics

Each film contains between 1 to 3 genres stored in genre. However, they are stored in the same column meaning some processing is required to separate each one to count the true number. For example, Drama, Romance and Drama, Romance would be counted as three different genres, even though only two unique genres are present.

Print information about the dataset:

Graphs

Word frequency distribution of a given genre

Plot the most common words for a given genre. The first graph includes all words, although is dominated by stop words such as 'a', 'the', 'to'. The second graph has stop words removed. These are more characteristic of the genre, e.g "house" and "killer" for horror. The frequency distribution provide an interesting comparision against topics assigned through unsupervised topic modelling algorithms.

Label distribution

View the number of films belonging to each genre. Across all samples, the most common genres are Drama (26.7%), Comedy (16.4%) and Romance (8.0%). The least common is News with only one sample across the whole dataset. Other rare genres include Adult and Documentary which have two each, and Reality-TV which has three. After this is Film-Noir with 663. The rarest genres are removed later in the notebook as they do not provide enough samples to be representative.

Data Processing

Drop samples

As the dataset is large, samples are dropped so that the notebook can run in a reasonable time. n_samples specifies the maximum number of samples to use. Setting n_samples = 0 will cause all suitable samples to be used, though the notebook will be slow to run. For these experiments I have used 5000 samples as this provides a sufficient amount to perform topic modelling in a reasonable amount of time.

min_length specifies the minimum number of words in a description and samples less than this will be removed.

Setting remove_rare_genres = True will remove genres with less than rare_count instances from the movies' labels. If a sample does not have a label for any remaining genre, it will be dropped. For example, if rare_count = 10, genres [News, Adult, Documentary, Reality-TV] are removed.

max_genre_samples is the maximum number of samples of each genre allowed. For example if max_genre_samples = 100, a maximum of 100 samples are allowed for Comedy, another 100 for Drama etc. This can be turned off by setting max_genre_samples = 0.

Samples

Stop words

Create a custom stop word list movie_stop_words of the most common words across all movie descriptions.

Select which stop word list to use and print it. Set stop = 0 to use no pre-defined stop word list, stop = 1 to use NLTK, or any other value, e.g. stop = -1, for spaCy. Set add_custom_words = True to add the movie_stop_words to the stop words.

Here I have used NLTK to filter out common words that do not contribute to the meaning of the descriptions. This prevents unrepresentative words being picked up by the all the topics in the topic modelling algorithms.

Normalisation

Create a function to convert accented characters. For example, è becomes e in "Arsène Baudu and Hyacinthe, a pair of small-time crooks".

Load a model to produce contractions, e.g. can't becomes cannot in "They fall in love, but can't quite seem to get the timing right."

Processing samples

Process film descriptions and store as description_docs

  1. Normalisation by removing accents and contractions
  2. Split descriptions into sentences and perform tokenisation
  3. Remove puncuation
  4. Remove stop words (optional)
  5. Correct spelling (optional)
  6. Apply lemmatizing or stemming (optional)

Again I have a filter to remove samples if they contain less words than min_words. This is because removing stop words in the code above will sometimes make some descriptions too short which makes them unsuitable for model training.

Unsupervised Topic Modelling

Convert the list of films into a form suitable for LDiA and LSA. A dictionary is constructed mapping each word to its integer id. This dictionary is used to construct a document-term matrix.

Create a function to plot an interactive t-SNE graph.

Define a function to find the best k number of topics for either LDiA or LSA.

Latent Dirichlet Allocation (LDiA)

LDiA is a commonly used unsupervised machine learning algorithm used for topic modelling. For this project, the aim is to find hidden clusters of similarity between some descriptions, and to find distinguishing features of others.

LDiA on expected number of topics

As I selected the top 7 most common genres, it makes sense to see how the unsupervised algorithms categorise films into this many topics. However, as this technique is unsupervised, topics do not always correspond to a genre. This can be observed by comparing the common words of each topic in the graph below to the frequency distribution of genres in an earlier section. LDiA is therefore picking up on hidden patterns that may not be apparent to a human.

  1. Perform LDiA on all samples, setting the number of topics as the number of unique genres
  2. Calculate the coherence score
  3. Plot the results
  4. Optimise the model parameters

The graph above shows the similarities and difference between topics, along with their most relevent terms. While some tokens do overlap, adjusting the λ parameters shows that the topics do all have distinct words.

The t-SNE graph below allows each film to be plotted in a lower-dimension by selecting the most representative topic weights. The shape of the clusters reflect how the algorithm tries to keep similar films close together in the vector space.

Before optimisation, the majority of films' most relevant topic is the same. However, after optimisation the dominant topics are more varied. This difference can be observered by changing parameter use_optimised_params from True to False.

Optimise LDiA's alpha and eta hyperparameters

The parameters optimal_alpha and optimal_eta used for the model above were found using the function optimise_LDiA below. When LDiA was run on the default parameters, it had coherence score 0.141. While when run on the best alpha-eta combination, it achieved a higher score of 0.192.

LDiA experiment with different number of topics

The number of topics can vary as LDiA does not specifically learn to associate a topic with a genre. The quality of topics modelling can be assessed by the coherence score, with a higher score being better. In this experiment, different numbers are attempted to find the best score.

Although there are only 7 genres, generally it seems that a larger number of topics (k > n) yield a better score for LDiA.

When comparing the t-SNE graph below to the previous one with fewer topics, films now appear to be more split and seperated. This could be more useful for identifying characteristic features for categorising descriptions.

Latent Semantic Analysis (LSA)

LSA is a dimensionality reduction technique that uses singular value decomposition (SVD). Like LDiA, it can also be used for topic modelling.

LSA on expected number of topics

When using the desired number of topics, LSA achieves a far higher coherence score suggesting that it has produced topics that are less semantically similar to one another than those produced by LDiA. This makes sense as the LSA algorithm tries to separate samples, while LDiA keeps similar them close together. This is also reflected by the shape of the clusters in the t-SNE plot.

Experiment with different number of topics

The number of topics can vary as LSA does not specifically learn to associate a topic with a genre. The quality of topics modelling can be assessed by the coherence score, with a higher score being better. In this experiment, different numbers are attempted to find the best score.

The optimal number of topics for LSA is 2 as this achieved the highest coherence score.