Project Timeline

Gillespie mRNA Simulator (2022)

For this systems biology project, the dynamics of a dual reporter method were modelled over time using the Gillespie algorithm. In the reaction scheme, a modified gene produces two mRNAs which stochastically transcribe different reporter proteins independently of one another. Two different situations are modeled, where mRNA transcription is transcribed at a constant rate (function A), or mRNA transcription is self-repressive (function B).

GitHub Repository:

Languages and Tools:

R

R Markdown

RStudio

Haploid and Diploid Wright-Fisher Simulator (2022)

To test the effects of selection, recombination and linkage disequilibrium (LD) on haploid and diploid versions of the Wright-Fisher model of genetic drift, I designed a simulator in Python.

For the haploid model, a panmictic population of size N over t generations is created for a single loci with two alleles. The simulator demonstrates that positive selection can increase speed and rate of fixation of an allele in a population. For the diploid model, a population is initialised with two alleles in LD. The simulator can shown how LD can break down faster with higher recombination / negative selection.

GitHub Repository:

Languages and Tools:

Python

Jupyter Notebook

Phylogenetic Tree Inference (2022)

The goal of this project was to find the optimal mutation rate (μ) or single nucleotide polymorphism (SNP) assignments of an evolutionary tree. In this model, the evolutionary tree is assumed to follow the Jukes-Cantor model of DNA sequence evolution. Felsenstein’s pruning algorithm is applied to compute the likelihood of the tree from the terminal taxa's nucleotides. Maximising the log likelihood by testing different parameters allows the most likely μ or SNPs to be determined.

GitHub Repository:

Languages and Tools:

Python

Jupyter Notebook

Genomic Characterization of Metastatic Patterns (2022)

In an MPhil group project, we used the MSK-MET dataset containing genomic alterations for different types of cancer to investigate associations between genomic mechanisms and metastasis for 50 tumour types. The goal was to recreate and extend the results from the paper: "Genomic Characterization of Metastatic Patterns from Prospective Clinical Sequencing of 25,000 Patients".

Some of my contributions include:

  • Calculation of fraction genome altered (FGA) and whole genome duplication (WGD)
  • Annotating actionability of somatic alterations with OncoKB
  • Quantifying arm-level copy number abberation (CNA) with ASCETS

GitHub Repository:

Languages and Tools:

R

R Markdown

RStudio

OncoKB

cBioPortal

SNV Cancer Evolution Estimation (2022)

In this project, the clonal structure of a neoplastic tumour sample was estimated on the basis that closely related cells share sets of SNVs. The predicted structure allowed a clonal evolutionary tree to be derived and was used to calculate statistics for quantifying tumor heterogeneity and evolution.

GitHub Repository:

Languages and Tools:

Python

Jupyter Notebook

Genome Sequence Analysis with a Hidden Markov Model (2022)

To annotate coding regions of DNA in S. cerevisiae chromosome III, a Hidden Markov model (HMM) was implemented from scratch to model %GC across regions of the genome. Model parameters were estimated via the Baum-Welch algorithm. This allowed non-coding regions (state 0) and coding regions (state 1) to be inferred through estimating the most probable sequence of hidden states via the Viterbi algorithm.

The HMM annotations are fairly accurate, with the model able to identify the high %GC telomeric region TEL03L, and the rough location of some pseudogenes and genes such as YCL068C.

GitHub Repository:

Languages and Tools:

Python

Jupyter Notebook

Hopfield Network (2021)

In this work, a Hopfield network was created for simple pattern recognition. It was then evaluated by its capacity to recall patterns using either the Hebbian or Storkey learning rules, demonstrating that Storkey learning allows the network to retain a higher capacity.

GitHub Repository:

Languages and Tools:

Python

Jupyter Notebook

Solving the Magic 19 Puzzle through Simulated Annealing (2021)

Given 19 dots arranged in a hexagon, the aim of the magic 19 puzzle is to label the dots with the numbers 1 to 19 so that each set of three dots that lie along a straight-line segment add up to 22.

Simulated annealing is a heuristic method for solving optimisation problems. Applying simulated annealing to the puzzle, I made an R script that finds all four solutions.

GitHub Repository:

Languages and Tools:

R

RStudio

Joe's Pyramid Solver (2021)

Joe’s pyramid is a puzzle in which every stone is marked with a different one or two digit positive number. Where a stone rests on two others, its number is the sum of the numbers marked on the two stones on which it rests. The challenge is to find the value of the top stone X.

The puzzle is impossible to solve through brute force with 806 billion possible permutations for the lowest level alone. However, marking each stone on this level with a, b, c, d, e, f, and using the constraint that X = a + 5b + 10c + 10d + 5e + f, I made a script to solve the puzzle.

GitHub Repository:

Languages and Tools:

R

RStudio

Stan Differential Expression Mixture Model (2021)

During a research internship, I made a prototype for a new mixture model to detect differential expression in scRNA-seq data. For this I used Stan's automatic variational inference algorithm to model the posterior distribution of inter-individual differences from the grand means of genes.

GitHub Repository (Private):

Languages and Tools:

R

Stan

RStudio

PseudobuLk and Single cell Mixture models for Investigating Differential expression (PLASMID) (2021)

During an internship, I worked in collaboration with UK DIR Imperial College London on an ongoing project to robustly detect irregularities in microglia gene expression in Alzheimer’s patients. It is hoped this work can provide guidance for other researchers to reduce FDR in scRNA-seq differential expression analysis.

My contribution was creating PLASMID - a parallelizable Snakemake pipeline to benchmark pseudobulk, pseudoreplicate and mixture model differential expression methods. Supported methods include: DESeq2, edgeR, limma, Welch and Wilcoxon t-tests, MAST and NEBULA. By testing different parameter combinations, this gives 41 total variants.

After running each method, log fold-change and adjusted p-values are outputted and the results visualised through custom built interactive graphs. These allow a user to visually identify up and down regulated genes and compare the methods against an expected ground truth.

Publication:

Not yet avaliable

GitHub Repository (Private):

Languages and Tools:

R

Python

YAML

Plotly

RStudio

Snakemake

Anaconda

Bayesian Mixture Model For scRNA-seq Clustering (2021)

During a research internship, I helped test a novel Bayesian mixture model for clustering scRNA-seq data from mouse cortex and hippocampus, with the aim of identifying cell type markers in an interpretable way. This involved:

  • Creating interactive graphs to detect patterns of mean gene expression in clusters
  • Assisting development of the Snakemake pipeline
  • Code documentation

Publication:

Identifying sub-populations of cells in single cell transcriptomic data – a Bayesian mixture model approach to zero-inflation of counts

GitHub Repository:

Languages and Tools:

R

Python

Plotly

RStudio

Snakemake

Anaconda

Singularity

Clustering and Topological Data Analysis of Single-Cell RNA Sequencing Data (2020 - 2021)

For my Computer Science undergraduate final year project, I identified sub-populations of cells in scRNA-seq data dervied from lung adenocarcinoma, mouse cortex/hippocampus and simulated data. To do this, I experimented with various dimensionality reduction techniques including autoencoders and clustering methods. As a further exploration tool, I performed topological data analysis using the Mapper algorithm which revealed a detectable difference between real and simulated data.

Project Report:

Clustering and Topological Data Analysis of Single-Cell RNA Sequencing Data

Website:

GitHub Repository:

Languages and Tools:

Python

PyTorch

PyTorch Lightning

scikit-learn

R

Plotly

Jupyter Notebook

Google Colab

IMDb Movie Genre Predictor (2021)

In a natural language processing (NLP) group project, I tested different NLP techniques and model architectures to create a CI/CD pipeline to train and deploy a multi-label classifier. The classifier was trained on a dataset of movie descriptions to predict the top fitting genre(s) from 12 options. The state of the best model was saved to file and deployed on a custom built web server.

Website:

GitHub Repository:

Languages and Tools:

Python

PyTorch

scikit-learn

Plotly

Jupyter Notebook

Anaconda

Semi-Supervised Learning with TCNs for ECG Classification (2021)

A degree project I managed in which we experimented using a new semi-supervised learning approach to identify arrhythmia (a type of heart condition) from ECG time-series data.

We applied k-means clustering to create new labels for each ECG and used them to train a temporal convolutional network (TCN). Then we employed transfer learning to create a new network for classifying the original arrhythmia labels.

While transferring network weights did not make a significant difference compared to training the network without, our final model achieved 97% accuracy demonstrating a simple TCN architecture is just as, if not more suitable for time series ECGs than many highly optimized CNN or LSTM architectures published in literature at the time.

Website:

GitHub Repository:

Languages and Tools:

Python

TensorFlow

scikit-learn

Jupyter Notebook

Google Colab

Genetic Algorithm for Neural Network Weight Optimisation (2020)

The aim is to train a feed-forward, multi-layer perceptron to approximate the function y = sin(2x₁ + 2) + cos(0.5x₂) + 0.5, given that x₁, x₂ ∈ [0, π]. To optimize the network's weights, a genetic algorithm was implemented as the optimization algorithm. To aid convergence, local learning via Rprop was applied using approaches inspired by Lamarckian and Baldwinian models of evolution.

GitHub Repository:

Languages and Tools:

Python

PyTorch

Jupyter Notebook

Google Colab

Non-Dominated Sorting Genetic Algorithm (NSGA-II) (2020)

Given two functions f₁ and f₂, and the task is to find min{f₁, f₂}, an elitist NSGA-II was implemented to find the optimal candidates for x₁ x₂ x₃ which satisfy this multi-objective optimisation problem.

GitHub Repository:

Languages and Tools:

Python

Jupyter Notebook

Google Colab

Vigenère Cipher Cracker (2020)

The Vigenère cipher is a type of substitution cipher used to encrypt plaintext. Normally it is hard to crack. However, given two ciphertexts encrypted under the same key and a list of potential plaintext solutions, it is possible to obtain the key through brute force. In this project, I created a Java class VigenereCracker that can break the cipher.

GitHub Repository:

Languages and Tools:

Java

Eclipse

Sinhala Script Optical Character Recognition (2020)

In this project, I experimented creating a basic optical character recognition (OCR) system that can take images of printed Sinhalese characters and to convert them to machine readable text using a KNN classifier.

GitHub Repository:

Languages and Tools:

Python

scikit-learn

Jupyter Notebook

Google Colab

SUBS web application (2020)

Surrey University Buy & Sell (SUBS) is a web application to allow Surrey University students to sell their unwanted goods. It was developed as part of a seven person, undergraduate group project using Ruby on Rails (a web framework with an MVC architecture). Although it was not officially launched, a prototype is hosted on Heroku.

My role for this project included:

  • Project management (arranging team meetings, coordination with the project sponsor, contacting members individually to solve issues and tracking the overall progress of the project to ensure deadlines were met)
  • Front end and back end development
  • Graphics design

Website:

GitHub Repository:

Languages and Tools:

Ruby

Rails

HTML

JavaScript

SCSS

Atom

Bluetooth Controlled Wrench Mask (2018 - 2019)

As a fun project, I created a unique, real life version Wrench's light up mask (from Watch Dogs 2) using an Arduino UNO. The LED matricies are controllable over bluetooth by a custom made Android application.

Website:

GitHub Repository:

Languages and Tools:

Arduino

Java

Android Studio

Genetic Algorithm Stick Figure Obstacle Course Solver (2017 - 2018)

For my A level Computer Science project, I designed a genetic algorithm (a heuristic search algorithm inspired by natural selection) that over time allows a population of stick figures to learn to solve an obstical course.

GitHub Repository:

Languages and Tools:

Python

Wing IDE