Referee README
This page contains all info about the Referee program including its inputs, options, and outputs.
Installation
Clone or download the github repo: Referee github
The only dependency is Python 3 or higher. You may want to add the Referee folder to your $PATH variable for ease of use!
Usage
These are the general steps for scoring your genome:
Using any applicable software, map the reads from which you constructed your genome back to the finished assembly. (A BAM file is usable by ANGSD for calculating genotype likelihoods in the next step)
Compile a pileup file for Referee to calculate genotype likelihoods OR pre-calculate genotype log-likelihoods for all 10 genotypes at every position in the genome (we recommend ANGSD for this).
Score your genome with one of the following Referee commands:
python referee.py -gl [genotype likelihood file] -ref [reference genome FASTA file] --pileup
If you have pre-calculated genotype likelihoods as input, exclude the
--pileup
flag.
Input
There are two main inputs for the program:
A genotype log-likelihood file (
-gl
) or files . File(s) can be either pre-calculated genotype log-likelihoods in a certain format (see below), or a pileups from which Referee will calculate genotype likelihoods. See the walkthrough for more info.If you use a pileup file as input, be sure to use the
--pileup
flag.If you have pre-calculated genotype log-likelihoods, they must be formatted in a tab delimited file with the following columns and no column headers:
If your input has a different column ordering, you will not get accurate scores!Scaffold ID Position AA AC AG AT CC CG CT GG GT TT
For example, the following output snippet from ANGSD is acceptable:
Note that ANGSD also scales the log likelihoods by subtracting the highest likelihood from each likelihood. This has no effect on Referee's scoring.scaffold_0 5 0.000000 -0.693147 -0.693147 -0.693147 -15.374639 -15.374639 -15.374639 -15.374639 -15.374639 -15.374639 scaffold_0 6 0.000000 -1.386294 -1.386294 -1.386294 -30.519020 -30.519020 -30.519020 -30.519020 -30.519020 -30.519020 scaffold_0 7 -30.288761 -30.288761 -1.386294 -30.288761 -30.288761 -1.386294 -30.288761 0.000000 -1.386294 -30.288761 scaffold_0 8 -27.986172 -1.386293 -27.986172 -27.986172 0.000000 -1.386293 -1.386293 -27.986172 -27.986172 -27.986172 scaffold_0 9 -27.755912 -1.386292 -27.755912 -27.755912 0.000000 -1.386292 -1.386292 -27.755912 -27.755912 -27.755912 scaffold_0 10 -8.689986 0.000000 -10.076280 -10.076280 -29.821151 -30.514277 -30.514277 -40.590558 -40.590558 -40.590558
- A reference FASTA file containing the sequences used to calculate genotype log-likelihoods (
-ref
). The FASTA headers in this file must match those in the first column of the ANGSD or pileup file(s) specified with-gl
. By default, Referee will trim the FASTA headers at the first occurrence of a space character, so be sure to account for this. I admit this is a shaky workaround, but given the non-standard nature of FASTA files its what I came up with. Please contact me if you would like some other header format implemented.
Output
With no specification, Referee will create one output file and one log file. A FASTQ output file can be created with
the --fastq
flag and a Bed file can be created with the --bed
flag.
By default, all outputs created by Referee will be files beginning with
referee-out-[start date]-[start time]-[random 6 char string]
. To change this, use the -o
option.
For the following examples we assume -o ref-out
has beend specified.
ref-out.log
This is a log file containing information about the Referee run, including the options used and specified inputs and outputs. It also contains runtime and memory usage (if the Python module
psutil
is available) info for each step. If you have many inputs or use many processors these runtime statistics can be distracting, so you can disable with--quiet
ref-out.txt
This is a tab delimited output file containing the Referee scores for every position in the input reference genome. This file has the following columns:
Scaffold ID Position Referee score
Example:
scaffold_0 5 0 scaffold_0 6 13 scaffold_0 7 13 scaffold_0 8 12 scaffold_0 9 12 scaffold_0 10 13
If you specify the
--correct
option, then Referee will also output higher scoring bases for positions that score 0. This file will have two extra columns:Scaffold ID Position Referee score Corrected base Referee score for corrected base
Example with
--correct
and one position with a better scoring base (position 5):scaffold_0 5 0 A 6 scaffold_0 6 13 scaffold_0 7 13 scaffold_0 8 12 scaffold_0 9 12 scaffold_0 10 13
ref-out.fq
If
--fastq
is specified Referee will create a FASTQ file with the reference genome annotated with Referee's scores Referee scores are encoded as ASCII characters with the following method:FASTQ score = ascii(Integer score + 35)
For example, the ASCII character
S
corresponds to the decimal 83. That means the score at this position was 83 - 35 = 48.Example FASTQ output:
@scaffold_0 1:40 length=40 GGTGTAGCCAGAGAGTAAANAATATGGTGAAGCCAGAGAG + !!!!#00//0442.45=CK"CKKLLKLKLKSRRRSSRSSS
If the
--correct
option is specified, corrected bases will be lower case. All others should be upper case.ref-out-bed-files/[scaffold ID].bed
If
--bed
is specified Referee will create Bed files to visualize Referee scores in genome browsers. Referee will create one Bed file for every scaffold present in the input genotype likelihood file(s) and place them in the directoryref-out-bed-files/
.
Options Table
Option | Description |
---|---|
-gl |
A single pileup file or a single file containing log genotype likelihoods for every site in your genome with reads mapped to it. Can be gzip compressed or not. If using pre-calculated log likelihoods, see the important information below regarding the order of the columns in the file. |
--pileup |
If this option is set, Referee will read the input file(s) in pileup format and use this info to calculate genotype likelihoods prior to the reference quality score. |
-ref |
A FASTA formatted file containing the genome you wish to score. Can be gzip compressed or not. FASTA headers must match the sequence IDs in column one of the pileup or genotype log likelihood file. |
-o |
The desired output directory. Default: referee-[date]-[time] . |
-prefix |
Referee will create at least 2 output files: a tab delimited score file and a log file. Use this option to specify
a prefix for these file names. Otherwise, they will default to referee-[date]-[time] . |
--overwrite |
By default, if the specified output directory already exists, Referee will exit with a warning. Set this option to bypass this warning and allow Referee to overwrite the files in this directory. |
--mapq |
If pileup file(s) are given as input, set this to incorporate mapping quality into Referee's quality score calculation.
Mapping quality can be output by samtools mpileup with the -s option, and will appear in the 7th column of the file.
If --mapq is not set, mapping qualities will be ignored even if they are present. |
--fastq |
With this option, Referee scores will also be output in FASTQ format. Scores will be converted to
ASCII characters: score + 35 = ASCII char. Note 1: If
--correct is set, corrected bases will appear as lower case. Note 2: This option cannot be set with
--mapped . |
--bed |
Referee can output scores in binned BED format for visualizing tracks of scores in most genome browsers. One .bed
file will be created for each scaffold scored and these will be placed in a directory ending with -bed-files.
Note: This option cannot be set with --mapped . |
--haploid |
Set this option if your input sequencing data comes from a haploid species. Referee will limit it's likelihood calculations
to single base states. Note: This option can only be used with an input --pileup file. |
--correct |
With this option, sites where reads do not support the called reference base (score <= 0) will have a higher scoring base suggested. In the tab delimited output, the corrected base and score are reported in additional columns. In FASTQ output, the corrected positions are indicated by lower case bases. |
--mapped |
Only report scores for sites with reads mapped to them. Note: This option cannot be set with --fastq
or --bed . |
--quiet |
Set this option to prevent Referee from printing out runtime statistics for each step. |
-p |
The number of processes Referee can use. |
-l |
The number of input lines read per process. Default: 100000. Decreasing this number may improve memory usage at the cost of slightly higher run times. |