This page contains all info about the Referee program including its inputs, options, and outputs.
Clone or download the github repo: Referee github
The only dependency is Python 3 or higher. You may want to add the Referee folder to your $PATH variable for ease of use!
These are the general steps for scoring your genome:
Using any applicable software, map the reads from which you constructed your genome back to the finished assembly. (A BAM file is usable by ANGSD for calculating genotype likelihoods in the next step)
Compile a pileup file for Referee to calculate genotype likelihoods OR pre-calculate genotype log-likelihoods for all 10 genotypes at every position in the genome (we recommend ANGSD for this).
Score your genome with one of the following Referee commands:
python referee.py -gl [genotype likelihood file] -ref [reference genome FASTA file] --pileup
If you have pre-calculated genotype likelihoods as input, exclude the
There are two main inputs for the program:
A genotype log-likelihood file (
-gl) or files . File(s) can be either pre-calculated genotype log-likelihoods in a certain format (see below), or a pileups from which Referee will calculate genotype likelihoods. See the walkthrough for more info.
If you use a pileup file as input, be sure to use the
If you have pre-calculated genotype log-likelihoods, they must be formatted in a tab delimited file with the following columns and no column headers:
If your input has a different column ordering, you will not get accurate scores!
Scaffold ID Position AA AC AG AT CC CG CT GG GT TT
For example, the following output snippet from ANGSD is acceptable:
Note that ANGSD also scales the log likelihoods by subtracting the highest likelihood from each likelihood. This has no effect on Referee's scoring.
scaffold_0 5 0.000000 -0.693147 -0.693147 -0.693147 -15.374639 -15.374639 -15.374639 -15.374639 -15.374639 -15.374639 scaffold_0 6 0.000000 -1.386294 -1.386294 -1.386294 -30.519020 -30.519020 -30.519020 -30.519020 -30.519020 -30.519020 scaffold_0 7 -30.288761 -30.288761 -1.386294 -30.288761 -30.288761 -1.386294 -30.288761 0.000000 -1.386294 -30.288761 scaffold_0 8 -27.986172 -1.386293 -27.986172 -27.986172 0.000000 -1.386293 -1.386293 -27.986172 -27.986172 -27.986172 scaffold_0 9 -27.755912 -1.386292 -27.755912 -27.755912 0.000000 -1.386292 -1.386292 -27.755912 -27.755912 -27.755912 scaffold_0 10 -8.689986 0.000000 -10.076280 -10.076280 -29.821151 -30.514277 -30.514277 -40.590558 -40.590558 -40.590558
- A reference FASTA file containing the sequences used to calculate genotype log-likelihoods (
-ref). The FASTA headers in this file must match those in the first column of the ANGSD or pileup file(s) specified with
-gl. By default, Referee will trim the FASTA headers at the first occurrence of a space character, so be sure to account for this. I admit this is a shaky workaround, but given the non-standard nature of FASTA files its what I came up with. Please contact me if you would like some other header format implemented.
With no specification, Referee will create one output file and one log file. A FASTQ output file can be created with
--fastq flag and a Bed file can be created with the
By default, all outputs created by Referee will be files beginning with
referee-out-[start date]-[start time]-[random 6 char string]. To change this, use the
For the following examples we assume
-o ref-out has beend specified.
This is a log file containing information about the Referee run, including the options used and specified inputs and outputs. It also contains runtime and memory usage (if the Python module
psutilis available) info for each step. If you have many inputs or use many processors these runtime statistics can be distracting, so you can disable with
This is a tab delimited output file containing the Referee scores for every position in the input reference genome. This file has the following columns:
Scaffold ID Position Referee score
scaffold_0 5 0 scaffold_0 6 13 scaffold_0 7 13 scaffold_0 8 12 scaffold_0 9 12 scaffold_0 10 13
If you specify the
--correctoption, then Referee will also output higher scoring bases for positions that score 0. This file will have two extra columns:
Scaffold ID Position Referee score Corrected base Referee score for corrected base
--correctand one position with a better scoring base (position 5):
scaffold_0 5 0 A 6 scaffold_0 6 13 scaffold_0 7 13 scaffold_0 8 12 scaffold_0 9 12 scaffold_0 10 13
--fastqis specified Referee will create a FASTQ file with the reference genome annotated with Referee's scores Referee scores are encoded as ASCII characters with the following method:
FASTQ score = ascii(Integer score + 35)
For example, the ASCII character
Scorresponds to the decimal 83. That means the score at this position was 83 - 35 = 48.
Example FASTQ output:
@scaffold_0 1:40 length=40 GGTGTAGCCAGAGAGTAAANAATATGGTGAAGCCAGAGAG + !!!!#00//0442.45=CK"CKKLLKLKLKSRRRSSRSSS
--correctoption is specified, corrected bases will be lower case. All others should be upper case.
--bedis specified Referee will create Bed files to visualize Referee scores in genome browsers. Referee will create one Bed file for every scaffold present in the input genotype likelihood file(s) and place them in the directory
||A single pileup file or a single file containing log genotype likelihoods for every site in your genome with reads mapped to it. Can be gzip compressed or not. If using pre-calculated log likelihoods, see the important information below regarding the order of the columns in the file.|
||If this option is set, Referee will read the input file(s) in pileup format and use this info to calculate genotype likelihoods prior to the reference quality score.|
||A FASTA formatted file containing the genome you wish to score. Can be gzip compressed or not. FASTA headers must match the sequence IDs in column one of the pileup or genotype log likelihood file.|
||The desired output directory. Default: |
||Referee will create at least 2 output files: a tab delimited score file and a log file. Use this option to specify
a prefix for these file names. Otherwise, they will default to |
||By default, if the specified output directory already exists, Referee will exit with a warning. Set this option to bypass this warning and allow Referee to overwrite the files in this directory.|
||If pileup file(s) are given as input, set this to incorporate mapping quality into Referee's quality score calculation.
Mapping quality can be output by samtools mpileup with the |
||With this option, Referee scores will also be output in FASTQ format. Scores will be converted to
ASCII characters: score + 35 = ASCII char. Note 1: If
||Referee can output scores in binned BED format for visualizing tracks of scores in most genome browsers. One |
||Set this option if your input sequencing data comes from a haploid species. Referee will limit it's likelihood calculations
to single base states. Note: This option can only be used with an input |
||With this option, sites where reads do not support the called reference base (score <= 0) will have a higher scoring base suggested. In the tab delimited output, the corrected base and score are reported in additional columns. In FASTQ output, the corrected positions are indicated by lower case bases.|
||Only report scores for sites with reads mapped to them. Note: This option cannot be set with |
||Set this option to prevent Referee from printing out runtime statistics for each step.|
||The number of processes Referee can use.|
||The number of input lines read per process. Default: 100000. Decreasing this number may improve memory usage at the cost of slightly higher run times.|