Referee: Genome assembly quality scores
Referee is a program to calculate a quality score for every position in a genome assembly. This allows for easy filtering of low quality sites for any downstream analysis.
Thomas GWC and Hahn MW. 2019. Referee: reference assembly quality scores. Genome Biology and Evolution 10.1093/gbe/evz088
Update: v1.2 — 08.25.2020
Referee now implements a more streamlined multi-processing scheme that does not require duplications of large input files or multiple read-throughs of input files!
About
Modern genome sequencing technologies provide a succinct measure of quality at each position in every read, however all of this information is lost in the assembly process. Referee summarizes the quality information from the reads that map to a site in an assembled genome to calculate a quality score for each position in the genome assembly.
We accomplish this by first calculating genotype likelihoods for every site. For a given site in a diploid genome, there
are 10 possible genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT). Referee takes as input the genotype likelihoods
calculated for all 10 genotypes given the called reference base at each position. For haploid genomes, the likelihood
calculations are limited to the four bases with the --haploid
option.
To obtain these likelihoods, one must first map the reads used to make the assembly back onto the finished assembly.
Then these reads can be used to calculate genotype likelihoods using any method/program. Referee can calculate
the likelihoods from a pileup file as input or use pre-calculated log likelihoods, such as those output by
ANGSD. Then, Referee compares the log of the ratio of the sum of
genotype likelihoods for genotypes that contain the reference base vs. the sum of those that do not contain the reference base.
Positive scores indicate support for the called reference base while negative scores indicate support for some other base. Scores close
to 0 indicate less confidence while higher scores indicate more confidence in the reference base. Scores range from 0 to 91, with some
special cases (see README). With the --correct
option specified Referee will
also output the highest scoring base for sites with negative scores.