Skip to main content
Data-Centric Engineering

Biological & Behavioural Sciences

Below you will find Data-Centric Engineering projects offered by supervisors within the School of Biological & Behavioural Sciences

This is not an exhaustive list. If you have your own research idea, or if you are a prospective PDS candidate, please return to the main DCE Research page for further guidance, or contact us at dce-cdt@qmul.ac.uk  

Developing automated microscopy image analysis pipelines for life science applications

Microscopy can easily generate Terabytes of data from time-lapse movies, 3D image stacks and high-throughput drug screens. Automated analysis of these data in a quantitative fashion requires sophisticated image processing tools and machine-learning algorithms. The APEER platform founded by Zeiss is one such online platform where new discoveries can be made simply by utilising the power of automated data analysis. In this project, the Scholar will set up a sophisticated image analysis pipeline to track chromosome movements in dividing human cells. This effort of combining computation strengths to uncover quantitative cell biological information can support drug screening assays in the pharma industry and thus is of direct relevance to engineering and healthcare sectors.

Supervisor: Prof Viji Draviam 

Reconstructing evolutionary histories from big genomic data

Large-scale projects are sequencing genomes for thousands of organisms. For example, the Darwin Tree of Life project is currently sequencing >60K animal and plant genomes, while >700K SARS-Covid-2 genomes have been sequenced to track the Covid-19 pandemic. Such staggering amounts of data are hard to analyse with current computational methods. The dos Reis and Nichols labs develop Bayesian methods for analysis of Big Genomic Data.

In this project, the Scholar will extend a prototype Hamiltonian MCMC sampler developed in the dos Reis lab to reconstruct evolutionary histories from genomic data. This sampler can be used to track evolution from viruses and other human pathogens to plants and animals.

This project is suitable for a student with a background in maths, physics, computer science or similar and with programming experience in a major language (such as C++ or Java).

Experience of Bayesian statistics and R programming is desirable but not essential.

Keywords: Hamiltonian dynamics, Bayesian statistics, MCMC, big data, genetics, genomics, evolution

Supervisor: Dr Mario dos Reis Barros

A toolkit for pragmatic interrogation exploration & hypothesis testing of disconnected genomic data

The 50,000-fold drop in DNA sequencing costs over 10 years creates tremendous opportunities for research and applications. Biology is transforming into a datascience. However, obtaining biological insight from the newly available data is challenging because:
1. data types and analysis algorithms are regularly superseded;
2. data is fragmented and unconnected;
3. concepts to connect data sources are fuzzy – including “gene” and “species”;
4. PIs and most biologists have limited analysis skills.

Biologists now often face the challenge of answering the following types of questions: My experiment finds 100 genes that differently represented between the invasive and the benign form of this pest. Do such differences also exist between invasive and benign forms of another one of the most closely related species? Which tissues are these genes active in the most closely related species? Is there evidence of recent changes in the active sites of these genes?” Answering such questions is crucial for interpretation and prioritising follow-up experiments. However, even unix-savvy biologists require months of work for this because it requires multi-level queries across disparate sources (“structured data lakes”) which in many cases have unknown connections that must first be computed.

This EngD project aims to produce a pragmatic tool to facilitate such interrogation.
Here, we will develop a practical approach to automate such interrogation, following a reasonable preconfigured decision tree. We will demonstrate the relevance by applying it & providing new biological insight on study systems of two external stakeholders. We will package the results of our work in a manner that makes it accessible to biologists. It will build on our extensive success with http://sequenceserver.com (>10,000 users in private & public research & development) and our track record of obtaining insight on biological processes (https://wurmlab.github.io/publications). Overall, our approach will substantially increase the efficiency and accuracy of genomic researchers.

Supervisor: Dr Yannick Wurm

Generation of synthetic high-throughput genomic data using deep learning

The recent technological advances in DNA and RNA sequencing allowed for the generation of large-scale genomic data from model and non-model species. Inferences of population parameters is now hampered by the scale of available data as commonly used statistical approaches are not suitable or efficient under such conditions. Applications of machine learning, and specifically deep learning, in population genomics has been pioneered by Fumagalli and collaborators (www.evogenomics.ai) and has been proposed as tool to overcome some of the ongoing challenges in this discipline.

On the grand challenges in population genomics is how to efficiently generate reliable synthetic data sets which can be used for benchmarking new bioinformatics technologies or for training machine learning algorithms. The intrinsic uncertainty of sequencing genomic experiments associated with a plethora of platforms, protocols and sampling conditions hamper model-based approaches to provide sufficiently realistic realisations of data sets. Recently, generative adversarial networks have been proposed to generate synthetic genomes under arbitrary population models, but their applications are limited to model organisms and do not account for the data uncertainty of sequencing experiments.

This project will uncover the potential of deep learning to generate high-throughput population genomic data sets efficiently under a wide range of experimental scenarios (e.g., DNA short/long reads, RNA-seq, single cell sequencing, …) for non-model organisms. The student will then use such synthetic data sets to benchmark a series of commonly used bioinformatics pipelines for assembly, variant calling and population genetic analyses to assess their performance.

The project will also exploit the trained algorithms to gain novel biological insights by applying to large-scale population genomic data sets. Specifically, the student will design, implement, train and deploy a generative adversarial network for genomic data from multiple Anophelesgambiaemosquito populations in Africa to infer past and recent population size changes, as well as estimate migration rates between geographical locations. Understanding the geographical movement of malaria mosquito vectors is essential to monitor the spread of insecticide-resistance mutation and plan control strategies

Keywords: genomics, bioinformatics, deep learning, adversarial networks

Supervisors: Dr Matteo Fumagalli, Dr Yannick Wurm 

Back to top