The sequencing of the first human genomes over the past twenty years has laid the foundation for rapid advances in our understanding of human genetics, the intricacies of cell regulation and development as well as the forces that drive evolution itself cite{schatz}.

The initial protocols developed for sequencing DNA, developed in the 1970s were slow, with the capacity to only reliably sequence a few thousand base pairs per person per week cite{sanger}. However, the maturation of high-throughput biotechnological platforms such as massively parallel sequencing, has heralded an unprecedented volume of data in biomedical research. The scale of improvements to throughput have been matched by a similar reduction in the cost of sequencing.

Furthermore, the technologies used for sequencing DNA have been re-purposed for several other applications, for example protocols for measuring the levels of mRNA transcription (RNA-seq) or translation inside of cells (Ribo-seq), giving us powerful methods for investigating regulatory relationships between different molecular layers.hspace{2pt}As we look ahead, the adoption of high-throughput sequencing technologies will extend biology into the realm of big data science, comparable with astronomy and particle physics, laying open the potential of using multiomic data as a clinical tool cite{schatz,stephens}. The volume and heterogeneity of big biological datasets will lead to unique computational challenges, and will require the development of novel statistical and machine learning tools to answer key biological and medical questions.hspace{2pt}Historically, expression experiments were limited to providing measurements that are averaged out over hundreds, or thousands of cells.

This can mask or even misrepresent signals of interest, limiting our ability to understand and model heterogeneous cell populations. Fortunately, recent technological advances allow us to resolve the molecular structure of a population at the level of a single cell. Single cell assays give us an exciting opportunity to explore heterogeneity, such as in the application of such techniques to study cancerous tumors, the structure of which bulk sequencing could never hope to recover cite{tsoucas}.

An understanding of cells at this resolution will help to address problems such as understanding the development of therapy resistance in cancer, or the origin of autoimmune diseases and how they could be treated cite{navin}.hspace{2pt}More recently, single-cell RNA-sequencing (scRNA-seq) has become an efficient technique for the rapid and cheap profiling of the transcriptome at the level of a single cell. Using this technique, mRNA molecules can be extracted from single cells and amplified to the abundance required for sequencing. The reconstruction of cell differentiation, or other dynamic processes is possible algorithmically using scRNA-seq data taken from a population of individually unsynchronised cells. This approach is validated by the assumption that an unsynchronised population contains cells at all stages of a given dynamic process, if the population is large enough.

This is known as pseudo-temporal ordering, and expression dynamics may be resolved by reordering the cells according to their position along a differentiation path. High-throughput longitudinal studies remain impractical, therefore learning the dynamics of a cell will rely such psuedo-time approaches, based on time series that are cross sectional in nature. hspace{2pt}The first robust and efficient method for ordering cells according to their position along a differentiation process, Monocle, is illustrative of the general approach taken by pseudo-temporal ordering algorithms. There is an initial dimensionality reduction step in which the high-dimensional cellular space, with dozens to thousands of genes or protein markers, is converted to a simplified representation using independent component analysis (ICA). After the reduction in dimensionality, a minimal spanning tree (MST) is calculated, for which the longest connected path (or paths) is determined within the graph.

Each individual cell in the sample is hen assigned to the nearest point in this inferred trajectory cite{trapnell}. An assumption of pseudo-temporal ordering, that genes do not change direction often and therefore that similar transcriptional profiles imply that samples should be placed close to each other in order, illustrates a weakness in the approach namely, that it will not work for oscillatory processes, for example two genes with an identical expression frequency that are phase shifted. Algorithms such as Oscope have been developed to reconstruct such processes cite{oscope}. However, this goes to show that there are likely many processes which are not well described by pseudo-temporal ordering as it stands.

hspace{2pt}Molecular layers are not independent of each other, recently Clark et. al. have demonstrated parallel single-cell sequencing protocols for the joint profiling of chromatin accessibility, DNA methylation and transcription in single cells (scNMT-seq) cite{clark}. A potential avenue of research would be to study the possibility of incorporating our ability to resolve multiple epigenetic features, in conjunction with gene expression, to develop superior pseudo-temporal ordering algorithms. At the bleeding edge of research attempts to extend the pseudo-temporal model further such as Fischer et al.  through pseudo-dynamics, which is able to account for population growth dynamics such as population bursts and selection and approximate for the developmental potential function including stability information of cell states,  still fall far short of learning a true model for cell dynamics or understanding the regulatory relationships between molecular layers cite{fischer}. Therefore, as it stands algorithmic approaches are fundamentally limited in their ability to guide medical intervention.

However, this field is still in its infancy, with high-resolution data being a relatively recent innovation which will inevitably improve with iteration. A key question I hope to contribute to through my PhD is whether it is possible to learn a complete model of cellular dynamics? The resolution of this question would have direct and far reaching biological and medical implications, such as revealing the fundamental relationship between cell development and disease cite{cannoodt}.


I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out