Predictive genomics: The Sangam of statistics and science

By Aakash Binu
February 15, 2022

Genetic data is unique in its nature, being a portal to not just the past, but also the future; it does not abide by the arrow of time and gives us insights into ourselves as much it does to our environment. While genetics remains a hotbed for scientific inquiry as a theoretical discipline, research in genomics has opened thoroughfares to better medicine and therapy, becoming to engineering what genetics is to the physical sciences.

All great cultures are empowered and metamorphosed by the wisdom of generational planning. Native Americans, for instance, believed that one must act keeping in mind the welfare of the next seven generations following their time: a tenet of homeopathic practice and avowed even today by modern science. These cultures understood the pitfalls of analytical myopia and the need to think ahead of time to realize subsistence and more.

To dilettantes with only a passable knowledge of this subject, the term ‘predictive genomics’ may seem obsolete since all endeavors in the field of genetics are predictive in their intent. However, it is a discipline at the crossroads of various fields like personal genomics, phenology, and bioinformatics alongside many more. Its relevance lies in the understanding of the human phenotype (expressed traits in an individual) as a function of the person’s genotype and his/her environment and being able to integrate the two variables of the function in a population for generations using advanced computational modalities like AI and Machine Learning.

Much of today’s predictive genomics is a result of Genome wide Association studies (these studies were possible due to the rise of biobanks, safekeeps of data sets demonstrating great genomic and trait variation) done over the course of the last two decades, occluding further doubts on the inheritance of complex traits, narrowing them down to the thousands of genetic mutations that are known as SNPs (commonly pronounced ‘snips’). These SNPs are the fundamental unit of biological evolution of all organisms, according to neo-Darwinian theory; mutations in their myriad iterations can make or break a species in the long run as they can be at the core of the inheritance of new survival traits and speciation or the accumulation of traits that lead to extinction. They are random and continually occur around us.

Predating these discoveries, the human genome project was completed, about 20 years ago, bequeathing scientists with a huge amount of data; experts estimate that research in genomics could yield up to 40 exabytes of data. With such a staggering volume of complex data to navigate, AI and ML techniques become obvious candidates for enhancing the efficiency and accuracy of predictive genomics.

Today, pharmacogenetics powered by deep learning algorithms make for an excellent use-case of digitally powered predictive genomics. A study that seeks the correlation between drug response and the individual’s genotype, its research labs sequence and genotype millions of people to understand drug delivery and inhibition across different populations. Such data is groundbreaking in demonstrating the relevance of individual gene expression in the efficacy of various medicines.

Consider the gene CYP2D6, expressed primarily in the liver and is crucial for the metabolization of codeine, one of America’s most popular pain relievers. The gene codes for the synthesis of proteins that converts xenobiotics like codeine to water-soluble morphine through demethylation; it is worthy of note that codeine has no analgesic action by itself. Now, we know through statistical analysis that 1-5% of Americans is poor metabolizers of codeine to the point it elicits no response, while another 1-21% are ultra-metabolizers, leading to huge spikes in blood morphine levels but effective for a very short interval of time. Similar trends can be noted for hundreds of drugs as they are all metabolized by CYP2D6, with its 161+ recognized haplotypes(alleles) dictating the effectiveness of these drugs.

Today, we are struggling to fully comprehend the effects of these known gene combinations, let alone those of the numerous rare haplotypes formed through mutations (SNPs) found across the world further convoluted by, ironically, the uniformity of variation across ethnicities. it is however important as predicting the functions of these novel alleles is key to improving the drug responses of these patients.

It is at this juncture that deep learning becomes relevant as the next machine learning paradigm. By implementing techniques that combine CNN image analysis and transfer learning, we can build deep neural networks that can generate voluminous sequences with known variations spiked into these sequences. Once we generate scores for each allele, we can train a model to assign these scores, forcing the CNN to ‘learn’ key sequence features. The experimental data on rare variations can be used to refine the final layers of the network, which can then predict and assess the outcomes of using codeine in individuals with rare alleles of CYP2D6.

Looking ahead

One must not overlook the fact that ML in genomics is in its infancy; a field that is less than a decade old. The reason is that the 3-D relationships shared by genes are much more complex than pixels and their interactions; as mentioned earlier, image recognition and analysis have great potential in the ML world. Today, breakthroughs in research have led to the union of both these techniques: ML devices like Deep Gestalt can accurately diagnose( up to 91%) over 200 disorders through image analyses by observing facial phenotypic manifestations of genetic aberration(note that the utilization AI/ML in genomics corresponds to deep learning and that remains so for the scope of this discussion, as well).

Deep learning neural networks are of 4 types broadly and they use different inputs; while fully connected networks use k-mer match metrics, convolutional and recurrent networks can use DNA sequence data, as well as image and time-series measurements respectively. The fourth type, graph convolutional, utilizes protein-to-protein interaction and structure. These are often manifest as modalities that can identify sequence context features capable of predicting transcription factor binding, decoded networks revealing differential gene expression, and more.

Applying Deep Learning to arrive at this profound understanding of gene mutations can be pivotal in understanding the origins of tumors and the development of gene therapy for cancer prevention and the prevention and treatment of many genetic disorders. It can cut down on years of painstakingly slow research and help us arrive at possible solutions much faster, or it can help us choose the most promising solutions – potentially saving us millions of dollars that could be wasted by going down the wrong road.

Yet, these technologies have miles to go before achieving mainstream acclaim and widespread professional implementation; for now, scientific incubation in research labs under the eyes of experts is the best mode of development for predictive genomics.