From Best Matches to Gene Families: How to use paralogs in phylogenomics

Best match graphs (BMGs) arise naturally as the first processing intermediate in algorithms for orthology detection. Let T be a phylogenetic (gene) tree T and sigma an assignment of leaves of T to species. The best match graph (G,sigma) is a digraph that contains an arc from x to y if the genes x and y reside in different species and y is one of possibly many (evolutionary) closest relatives of x compared to all other genes contained in the species sigma(y). I will give two alternative characterizations of BMGs and show that a minimally resolved tree that explains a BMG can be reconstructed in cubic time. The symmetric part of a BMGs represents the empirical estimate for the orthology relation on the gene set as inferred from a reciprocal best match heuristic. BMGs are therefore close relatives of co-graphs, which describe perfect duplication/speciation scenarios. Whenever a BMG deviates from a cograph structure, this implies that the reciprocal best match heuristic has produced incorrect orthology assignments. A reasonable approach therefore it to correct the data by editing the BMG into its nearest co-graph. Cographs, in turn, are equivalent to event-labeled gene trees that identify duplication and speciation events. These trees also impose constraints on the species tree and the possible reconciliation maps. Taken together, therefore, it is possible to start from reciprocal best matches of the proteoms of a set of species and eventually arrive at the phylogenetic tree of these taxa without the use of a conventional tree reconstruction method. In fact, an analysis of the workflow show that it only makes use of gene duplication events, while sets of 1-1 orthologs do not contribute at all. In this sense the approach is orthogonal to classical phylogenetic methods.

Genome Matrices and the Median Problem

The study of genome rearrangements dates back to the beginning of the twentieth century, with the seminal work by Dobzhansky, Sturtevant, and others. More recently, near the end of the century, computers entered the scene, and the first algorithms to deal with various aspectes of genome rearangements appeard. In particular, the genome median problem, which is an important tool in phylogenetic reconstruction, gained attention. It can be stated as follows: Given three genomes, find a fourth that minimizes the sum of the pairwise rearrangement distances between it and the three input genomes. Among the several distances that have been proposed, the Double-Cut and Join (DCJ) distance stands out as the most widely used. In this talk, we briefly review the historic development of DCJ, and of a related distance, called rank distance, proposed by our group. We model genomes as matrices and study the matrix median problem using the rank distance. It is an open problem to find a polynomial algorithm for genomic medians using this distance. We present the state-of-the-art in this problem, with partial results, conjectures, and experimental results.

A birth-and-death account of the similarity distribution of homologous gene pairs in plants resulting from recurrent whole genome doubling, fractionation and speciation

An important type of evidence in analyzing ancient polyploidization events is the distribution of coding sequence similarities between two paralogous genes in a genome. For flowering plants with one, two, or more whole genome doubling (WGD) events in their history, the distribution of similarities is a mixture of distributions, each component of which is centered at a similarity value indicative of the age of one of the events. We present a birth-and-death model for predicting the shape of these distributions based on the event times, the ploidy multiplicities of the events, rates of loss of duplicate genes from the genome (fractionation), and rates of sequence divergence. The process has one biologically-motivated constraint, which is mathematically tractable and whose parameters are well suited to statistical inference.  We then provide a treatment of the transition from one genome to two daughter genomes that smoothly continues the fractionation regime in place before speciation, and extends it independently in each of these species until a new WGD in that species or until the present time of observation. With some additional details, this solution enables a complete model encompassing all the WGD in the ancestral genome, the speciation event, and all the independent WGDs in the daughter genomes. It defines all the homolog pairs at the time of observation in terms of the event at which they originated.