Referee: Genome assembly quality scores

Referee is a program to calculate a quality score for every position in a genome assembly. This allows for easy filtering of low quality sites for any downstream analysis.

Thomas GWC and Hahn MW. 2019. Referee: reference assembly quality scores. Genome Biology and Evolution 10.1093/gbe/evz088

Update: v1.2 — 08.25.2020

Referee now implements a more streamlined multi-processing scheme that does not require duplications of large input files or multiple read-throughs of input files!

About

Modern genome sequencing technologies provide a succinct measure of quality at each position in every read, however all of this information is lost in the assembly process. Referee summarizes the quality information from the reads that map to a site in an assembled genome to calculate a quality score for each position in the genome assembly.

We accomplish this by first calculating genotype likelihoods for every site. For a given site in a diploid genome, there are 10 possible genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT). Referee takes as input the genotype likelihoods calculated for all 10 genotypes given the called reference base at each position. For haploid genomes, the likelihood calculations are limited to the four bases with the --haploid option.

To obtain these likelihoods, one must first map the reads used to make the assembly back onto the finished assembly. Then these reads can be used to calculate genotype likelihoods using any method/program. Referee can calculate the likelihoods from a pileup file as input or use pre-calculated log likelihoods, such as those output by ANGSD. Then, Referee compares the log of the ratio of the sum of genotype likelihoods for genotypes that contain the reference base vs. the sum of those that do not contain the reference base. Positive scores indicate support for the called reference base while negative scores indicate support for some other base. Scores close to 0 indicate less confidence while higher scores indicate more confidence in the reference base. Scores range from 0 to 91, with some special cases (see README). With the --correct option specified Referee will also output the highest scoring base for sites with negative scores.

For more information on the usage and inputs, see the README