Code by TomMakesThings

Imports

Although the notebook was designed to run in Colab, it may also be run in a Jupyter notebook. The code below will set the plotly graphing library to render graphs for Jupyter if detected.

Set Up Datasets

Optionally connect to Google Drive. This is not required if you wish to use one of the pre-set datasets: benchmark_dataset, splat_dataset or cortex_dataset. These datasets are opened in the cells below by downloading them from URL. If you want to use a different dataset, this will have to be manually set.

1. Benchmarking Dataset

Open and view the first dataset, referred to in this project as benchmark_dataset. This is a dataset is known as sc_10x by Luyi Tian composed of human lung adenocarcinoma cells from three cell lines. Further information about this dataset is avaliable in the paper scPipe: A flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data

View the three cell lines of the samples. These are used as labels to evaluate accuracy during clustering.

View the adjusted Rand index (ARI) from experiments using the same dataset. This provides a comparision against experiments in this notebook.

These are from the paper: scRNA-seq mixology: towards better benchmarking of single cell RNA-seq protocols and analysis methods

Note downloading this data is not essential to run the rest of the notebook.

2. Splat Simulated Dataset

Open the second dataset, referred to as splat_dataset. This data was created to mimic the gene expression of the benchmarking dataset by using the Splat simulator which is part of Splatter. The creation of this dataset is explained in the report Clustering and Topological Data Analysis of Single-Cell RNA Sequencing Data and is avaliable to download from my GitHub.

View the four groups of the cell samples. These are used as labels to evaluate accuracy during clustering.

3. Mouse Cortex Dataset

Open the third dataset, referred to as cortex_dataset. This is a dataset by by Zeisel et al. containing of mRNA reads from mouse cortex and hippocampus cells. It is avaliable to download from Linnarsson Lab and is discussed in the paper Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.

View the assigned types or tissue of the cell samples. This dataset provides four possible options for a group truth:

Custom Dataset

If not using one of the three given datasets, open a new dataset here as a DataFrame.

Optionally, also create a DataFrame matching samples to target labels for the clusters. If these are unknown, set custom_metadata = None.

Select Dataset to Use

Select which dataset you want to use by setting dataset as either benchmark_dataset, splat_dataset, cortex_dataset, or custom_dataset if you set a new dataset as a DataFrame above.

Leave metadata = None if using one of the four options above.

Autoencoder

An autoencoder is a neural network that learns a compressed version of the input through encoding and decoding. For scRNA-seq data, this is useful for dimensionality reduction as these datasets contain many features. For example, the benchmarking dataset has 16,468 genes, but the autoencoder can capture the underlying trend of the data in far fewer features, such as 16. Although some of the information is lost, this can be beneficial for removing noise.

image.png

Classes

DatasetRNASeq and LitAutoencoder classes.

Functions

Functions convert_to_weights_only and read_autoencoder_from_url.

The function below is not run anywhere in this notebook, though can be called to reduce the file size of the autoencoder's .ckpt (checkpoint) file if you trained a new model with weights_only = False.

This function is used to find the file paths required to construct the training, validation and testing dataloaders if using a pre-trained model.

Autoencoder Parameter Setup

To train a new model, set train_new_model = True.

Or alternatively, to load a pre-trained model, set train_new_model = False.

Parameter Override for Given Datasets

Override some parameters if loading a pre-trained model for one of the three given datasets.

Split Data into Training and Testing

If training a new model:

Autoencoder Arguments

If training a new autoencoder:

Load a Pre-Trained Autoencoder From File

Load the weights of a pre-trained autoencoder. This will only run if train_new_model = False.

K-Fold Cross Validation

Split training data into 80% train and 20% validation using k-fold cross validation to find the arrangement that performs best.

This part does not run if using a pre-trained model.

Train New Autoencoder

Runs if either:

The number of epochs is the number of training iterations.

Setting use_lr_finder = True will use PyTorch Lightning's learning rate finder to set learning rate automatically. This has by default been switched off as I found it did not improve performance.

Optionally train the autoencoder over more epochs by setting continue_training = True and running this cell. This may be used to if training is not complete.

Save Autoencoder State to File

For a new autoencoder's state to be saved to file, download_model must be set to True. This downloads:

Autoencoder Testing

Test the trained autoencoder's performance by returning the average loss on unseen test data.

Autoencoder Logger

View graphs of the autoencoder's training and testing performance using TensorBoard.

Encode Test Data

Now that the autoencoder has been trained, it is used to produce an encoding that will reduce the dimensionality of the dataset.

Clustering

Clustering Functions

Custom made functions used during clustering experiments.

Functions elbow_method and silhouette_coefficient.

Function to Perform Biclustering

Function bicluster.

Function to Plot Selected Test Cells

Function plot_cells to plot either a 2D or 3D graph of all the cells and the cells selected for testing.

Functions to Plot Clusters in 2D and 3D

Functions cluster_accuracy, cluster_metrics, reduce_data and plot_clusters.

cluster_accuracy will match the expected labels and predicted clusters using the Hungarian matching algorithm, then calculate accuracy between them. This will be returned, along with the mapping between the labels and clusters. cluster_metrics will return this accuracy, along with ARI and difference between labels and clusters.

image.png

reduce_data will perform dimensionality reduction, optionally apply t-SNE and standardization.

plot_clusters performs dimensionality reduction, applies clustering, then plots either a 2D or 3D scatter graph of the clusters.

Functions to Find the Best Number of Components

Functions find_best_components, plot_components and plot_variance.

Setup Data and Labels

Standardize Data

Read the original and encoded data, then create a pre-standardization versions to be used during experiments. This provides four different variations to test during the following experiments.

Labels

Get the true labels for the cell lines / groups and convert to numerical representations. This will only run if metadata has been set.

Encode All Data

Produce an encoded version of every sample in the dataset

Cluster Colours

Specify the colours to use for the clusters.

Plot Test Cells

Plot the cells selected for testing against all cells. If labels were given for the cell lines / groups, these are coloured.

Perform Clustering

For clustering algorithms such as k-means and hierarchical clustering, the number of clusters must be specified beforehand. For k-means, elbow method and the silhouette coefficient are two methods that can be used to find the optimal number. While for agglomerative hierarchical clustering, the best number can be determined by plotting a dendrogram.

Run the elbow method and silhouette coefficient for k-means:

Plot a dendrogram for agglomerative hierarchical clustering:

Set cluster_number as either:

Apply Clustering

Set algorithm as the clustering algorithm to run. This can be either:

Four variations of the data can be tested by changing these parameters:

A dimensionality reduction technique can be applied to extract features that best represent variation in the data in a lower-dimensional space. Variable reduction_technique can be set as one of the following:

t-distributed Stochastic Neighbor Embedding can be applied before clustering by setting use_tsne = True. The parameter perplexity can be changed, though ideally this value should be between 5 - 50 and less than the number of cells.

To perform spectral biclustering on the data before running standard clustering, set run_biclustering = True. This is a form of clustering that will cluster both the genes/encoded features and the cell samples.

Now create 2D and 3D interative graphs plotted against 2 / 3 components extracted by the dimensionality reduction technique.

Click on the legend labels to toggle which clusters to display:

Test Difference Component Numbers

Experiment using a higher numbers of components for the dimensionality reduction technique specified as reduction_technique to see the effect on either accuracy, ARI or silhouette coefficient.

Plot a 2D and 3D graph to view the clustering predictions when using the best number of components according to the metric set above. Note results may differ between the two plots as k-means is non-deterministic and is run twice.

View Contributions of Principal Components

If using PCA, view how much the top principal components contribute to the variance of the data.

View Most Variable Genes

If run on the original data, visualise the contribution genes to each component.

Find the Best Combination

Experiment with six different clustering algorithms and parameter sets ups to determine the best overall performance.

metric can be set as either accuracy or silhouette coefficient.

Set dimension_reduction as the dimensionality reduction technique to use:

test_iterations is the number of times to test each experiment set up.

run_tsne specifies whether to apply t-SNE after dimensionality reduction and run_tsne_perplexity is the parameter perplexity that can be set if using t-SNE.

use_all_data determines whether to run the experiments on all cells in the dataset.

Run Experiments to Find Optimal Parameters

Run experiments:

Show Results

Topological Data Analysis

Run Kepler Mapper to create a visualisation of the higher-dimensional shape of the dataset.

Apply the Mapper algorithm to build the simplicial complex.

image.png

image.png

Download graph as a HTML file so that it can be visualised