Skip to main content
Data-Centric Engineering

Biological & Chemical Sciences

Below you will find Data-Centric Engineering projects offered by supervisors within the School of Biological & Chemical Sciences

This is not an exhaustive list. If you have your own research idea, or if you are a prospective PDS candidate, please return to the main DCE Research page for further guidance, or contact us at  

Developing automated microscopy image analysis pipelines for life science applications

Microscopy can easily generate Terabytes of data from time-lapse movies, 3D image stacks and high-throughput drug screens. Automated analysis of these data in a quantitative fashion requires sophisticated image processing tools and machine-learning algorithms. The APEER platform founded by Zeiss is one such online platform where new discoveries can be made simply by utilising the power of automated data analysis. In this project, the Scholar will set up a sophisticated image analysis pipeline to track chromosome movements in dividing human cells. This effort of combining computation strengths to uncover quantitative cell biological information can support drug screening assays in the pharma industry and thus is of direct relevance to engineering and healthcare sectors.

Supervisor: Prof Viji Draviam 

Machine Learning and its application in the development of novel cannabinoids

A key step in the established drug discovery process is to generate a pool of suitable candidates for synthesis and subsequent biological screening, based on small structural modifications of an existing molecule of known biological activity, with the intent of improving potency, selectivity and minimising toxicity. Unfortunately, this requires substantial time, labour and financial resources. Molecular Dynamics and Monte Carlo Computer simulation techniques have greatly assisted the selection process owing to their ability to predict the binding affinity of a molecule with a given protein. In many cases, however, there is a ‘computational cost’ owing to the processing time and computational power required to obtain accurate calculations. A counterapproach is the use of knowledge-based and machine learning applications. Machine learning has been shown to be capable of yielding rapid and accurate predictions and has become orders of magnitude faster than traditional computational chemistry methods. Herein we propose to apply this emerging and potentially transformative method to the identification, targeting and subsequent synthesis of viable cannabimmetics, a class of compounds that have widespread health applications but have yet to be explored by such techniques

Supervisor: Dr Christopher Jones

Reconstructing evolutionary histories from big genomic data

Large-scale projects are sequencing genomes for thousands of organisms. For example, the Darwin Tree of Life project is currently sequencing >60K animal and plant genomes, while >700K SARS-Covid-2 genomes have been sequenced to track the Covid-19 pandemic. Such staggering amounts of data are hard to analyse with current computational methods. The dos Reis and Nichols labs develop Bayesian methods for analysis of Big Genomic Data.

In this project, the Scholar will extend a prototype Hamiltonian MCMC sampler developed in the dos Reis lab to reconstruct evolutionary histories from genomic data. This sampler can be used to track evolution from viruses and other human pathogens to plants and animals.

This project is suitable for a student with a background in maths, physics, computer science or similar and with programming experience in a major language (such as C++ or Java).

Experience of Bayesian statistics and R programming is desirable but not essential.

Keywords: Hamiltonian dynamics, Bayesian statistics, MCMC, big data, genetics, genomics, evolution

Supervisor: Dr Mario dos Reis Barros

A toolkit for pragmatic interrogation exploration & hypothesis testing of disconnected genomic data

The 50,000-fold drop in DNA sequencing costs over 10 years creates tremendous opportunities for research and applications. Biology is transforming into a datascience. However, obtaining biological insight from the newly available data is challenging because:
1. data types and analysis algorithms are regularly superseded;
2. data is fragmented and unconnected;
3. concepts to connect data sources are fuzzy – including “gene” and “species”;
4. PIs and most biologists have limited analysis skills.

Biologists now often face the challenge of answering the following types of questions: My experiment finds 100 genes that differently represented between the invasive and the benign form of this pest. Do such differences also exist between invasive and benign forms of another one of the most closely related species? Which tissues are these genes active in the most closely related species? Is there evidence of recent changes in the active sites of these genes?” Answering such questions is crucial for interpretation and prioritising follow-up experiments. However, even unix-savvy biologists require months of work for this because it requires multi-level queries across disparate sources (“structured data lakes”) which in many cases have unknown connections that must first be computed.

This EngD project aims to produce a pragmatic tool to facilitate such interrogation.
Here, we will develop a practical approach to automate such interrogation, following a reasonable preconfigured decision tree. We will demonstrate the relevance by applying it & providing new biological insight on study systems of two external stakeholders. We will package the results of our work in a manner that makes it accessible to biologists. It will build on our extensive success with (>10,000 users in private & public research & development) and our track record of obtaining insight on biological processes ( Overall, our approach will substantially increase the efficiency and accuracy of genomic researchers.

Supervisor: Dr Yannick Wurm