Gregg Thomas - Research

My main research interests center around how changes in DNA sequences play a role in adaptation. Molecular evolution — the changes that accrue in the DNA sequence of a population or species over time — can be driven by many factors, including evolution of the mutation rate, natural selection, and life history variation between species. Additionally, false inferences of molecular change can be caused by systematic error in the genome sequencing and comparative pipelines, and unaccounted phylogenetic discordance. Using phylogenetics and comparative genomics, I am interested in studying adaptation by identifying general patterns of molecular evolution and specific links between molecular changes and phenotypes. In the process, I develop software to solve specific problems and to help account for systematic errors through the genomic workflow. Below I outline some of the specific ways I have pursued these goals up to this point!

As I strive for high quality and reproducible research on my own, I also want to pass on what I learn by teaching best practices in bioinformatics, encompassing data science skills, programming and software development, and computational scientific communication. I develop courses and workshops to try to convey these concepts to biologists at any computational skill level, with an emphasis on access and equity. It is my hope that in teaching these skills, students and those that I mentor can make the most of them in their own research. Many of the underlying skills in bioinformatics are widely applicable to data science, giving students the ability to translate these skills into many possible domains and career paths.

Comparing genomes between related species in the context of their phylogeny provides us the opportunity to ask and answer questions about how these species evolved at the molecular level. Using comparative genomics we can study patterns such as substitution rate variation, convergent molecular evolution (right, bottom), gene family evolution (see below), gene expression and much more. I aim to uncover these patterns across the tree of life. I have done large-scale comparative studies in primates and insects, and I am now working on similar studies in rodents and turtles. And I am working on software to account for reference bias in read mapping.

Convergent evolution

Convergent evolution occurs when distantly related lineages evolve to share the same trait. A general assumption is that for a trait to converge in multiple lineages, it must be adaptive. That means convergent evolution provides a great opportunity to study adaptation. Until recently, convergent evolution was only observed at the phenotypic level (i.e. the defensive quills on the porcupine, echidna, and hedgehog). But with whole genome sequences available for many species we can now study convergent evolution at the molecular level and ask whether it can be linked to convergence at the phenotypic level. However, we've discovered that convergent amino acid substitutions occur all the time by chance in nature. This makes it difficult to pick out which convergent changes are actually adaptive. I've done work in this area to develop methods to detect molecular convergence while avoiding many pitfalls that come from high levels of background convergence.

Gene family evolution

Gene duplications and losses can open up new avenues for adaptation by changing the selective constraints and expression levels of genes. I have helped develop the latest version of CAFE by devising a method to estimate annotation and assembly error for gene count data. I have also delved into the study of whole genome duplications by developing a method to properly identify and place polyploidization events on phylogeny. This method has been implemented in the software GRAMPA. Finally, I led the phylogenetic and gene family analysis for the i5K pilot project (right, top). This is the largest gene family analysis to date and yielded a wealth of information as a resource for Arthropod researchers.

Phylogenetic discordance

A large part of comparative genomics is phylogenetics. Besides uncovering the relationships among species, the phylogeny provides the framework for us to ask and answer many of the interesting biological questions in comparative datasets. I am interested in studying patterns of phylogenetic discordance over time and across the genome and developing methods to account for it in comparative analyses. Specifically, in recent work we have found that phylogenies inferred from windows along the genome are more similar the closer they are. This discordance poses problems when inferring changes on a single species tree: if the underlying data do not follow that tree, evolutionary events will be mismapped. To account for this in comparative studies, I am developing methods to prune large phylogenies to maxmize their concordance with underlying gene trees, estimate substitution rates more accurately on a species tree, and estimating substitution rates in a Bayesian framework while accounting for discordance.