Referee README

This page contains all info about the Referee program including its inputs, options, and outputs.

Installation

Clone or download the github repo: Referee github

The only dependency is Python 3 or higher. You may want to add the Referee folder to your $PATH variable for ease of use!

Usage

These are the general steps for scoring your genome:

  1. Using any applicable software, map the reads from which you constructed your genome back to the finished assembly. (A BAM file is usable by ANGSD for calculating genotype likelihoods in the next step)

  2. Compile a pileup file for Referee to calculate genotype likelihoods OR pre-calculate genotype log-likelihoods for all 10 genotypes at every position in the genome (we recommend ANGSD for this).

  3. Score your genome with one of the following Referee commands:

    python referee.py -gl [genotype likelihood file] -ref [reference genome FASTA file] --pileup

    If you have pre-calculated genotype likelihoods as input, exclude the --pileup flag.

Input

There are two main inputs for the program:

  1. A genotype log-likelihood file (-gl) or files . File(s) can be either pre-calculated genotype log-likelihoods in a certain format (see below), or a pileups from which Referee will calculate genotype likelihoods. See the walkthrough for more info.

    If you use a pileup file as input, be sure to use the --pileup flag.

    If you have pre-calculated genotype log-likelihoods, they must be formatted in a tab delimited file with the following columns and no column headers:

    Scaffold ID  Position    AA  AC  AG  AT  CC  CG  CT  GG  GT  TT
    If your input has a different column ordering, you will not get accurate scores!

    For example, the following output snippet from ANGSD is acceptable:

    scaffold_0	5	0.000000	-0.693147	-0.693147	-0.693147	-15.374639	-15.374639	-15.374639	-15.374639	-15.374639	-15.374639
    scaffold_0	6	0.000000	-1.386294	-1.386294	-1.386294	-30.519020	-30.519020	-30.519020	-30.519020	-30.519020	-30.519020
    scaffold_0	7	-30.288761	-30.288761	-1.386294	-30.288761	-30.288761	-1.386294	-30.288761	0.000000	-1.386294	-30.288761
    scaffold_0	8	-27.986172	-1.386293	-27.986172	-27.986172	0.000000	-1.386293	-1.386293	-27.986172	-27.986172	-27.986172
    scaffold_0	9	-27.755912	-1.386292	-27.755912	-27.755912	0.000000	-1.386292	-1.386292	-27.755912	-27.755912	-27.755912
    scaffold_0	10	-8.689986	0.000000	-10.076280	-10.076280	-29.821151	-30.514277	-30.514277	-40.590558	-40.590558	-40.590558
    Note that ANGSD also scales the log likelihoods by subtracting the highest likelihood from each likelihood. This has no effect on Referee's scoring.

  2. A reference FASTA file containing the sequences used to calculate genotype log-likelihoods (-ref). The FASTA headers in this file must match those in the first column of the ANGSD or pileup file(s) specified with -gl. By default, Referee will trim the FASTA headers at the first occurrence of a space character, so be sure to account for this. I admit this is a shaky workaround, but given the non-standard nature of FASTA files its what I came up with. Please contact me if you would like some other header format implemented.

Output

With no specification, Referee will create one output file and one log file. A FASTQ output file can be created with the --fastq flag and a Bed file can be created with the --bed flag.

By default, all outputs created by Referee will be files beginning with referee-out-[start date]-[start time]-[random 6 char string]. To change this, use the -o option. For the following examples we assume -o ref-out has beend specified.

  1. ref-out.log

    This is a log file containing information about the Referee run, including the options used and specified inputs and outputs. It also contains runtime and memory usage (if the Python module psutil is available) info for each step. If you have many inputs or use many processors these runtime statistics can be distracting, so you can disable with --quiet

  2. ref-out.txt

    This is a tab delimited output file containing the Referee scores for every position in the input reference genome. This file has the following columns:

    Scaffold ID  Position    Referee score

    Example:

    scaffold_0	5	0
    scaffold_0	6	13
    scaffold_0	7	13
    scaffold_0	8	12
    scaffold_0	9	12
    scaffold_0	10	13

    If you specify the --correct option, then Referee will also output higher scoring bases for positions that score 0. This file will have two extra columns:

    Scaffold ID  Position    Referee score   Corrected base    Referee score for corrected base

    Example with --correct and one position with a better scoring base (position 5):

    scaffold_0	5	0	A	6
    scaffold_0	6	13		
    scaffold_0	7	13		
    scaffold_0	8	12		
    scaffold_0	9	12		
    scaffold_0	10	13

  3. ref-out.fq

    If --fastq is specified Referee will create a FASTQ file with the reference genome annotated with Referee's scores Referee scores are encoded as ASCII characters with the following method:

    FASTQ score = ascii(Integer score + 35)

    For example, the ASCII character S corresponds to the decimal 83. That means the score at this position was 83 - 35 = 48.

    Example FASTQ output:

    @scaffold_0 1:40 length=40
    GGTGTAGCCAGAGAGTAAANAATATGGTGAAGCCAGAGAG
    +
    !!!!#00//0442.45=CK"CKKLLKLKLKSRRRSSRSSS

    If the --correct option is specified, corrected bases will be lower case. All others should be upper case.

  4. ref-out-bed-files/[scaffold ID].bed

    If --bed is specified Referee will create Bed files to visualize Referee scores in genome browsers. Referee will create one Bed file for every scaffold present in the input genotype likelihood file(s) and place them in the directory ref-out-bed-files/.

Options Table

OptionDescription
-gl A single pileup file or a single file containing log genotype likelihoods for every site in your genome with reads mapped to it. Can be gzip compressed or not. If using pre-calculated log likelihoods, see the important information below regarding the order of the columns in the file.
--pileup If this option is set, Referee will read the input file(s) in pileup format and use this info to calculate genotype likelihoods prior to the reference quality score.
-ref A FASTA formatted file containing the genome you wish to score. Can be gzip compressed or not. FASTA headers must match the sequence IDs in column one of the pileup or genotype log likelihood file.
-o The desired output directory. Default: referee-[date]-[time].
-prefix Referee will create at least 2 output files: a tab delimited score file and a log file. Use this option to specify a prefix for these file names. Otherwise, they will default to referee-[date]-[time].
--overwrite By default, if the specified output directory already exists, Referee will exit with a warning. Set this option to bypass this warning and allow Referee to overwrite the files in this directory.
--mapq If pileup file(s) are given as input, set this to incorporate mapping quality into Referee's quality score calculation. Mapping quality can be output by samtools mpileup with the -s option, and will appear in the 7th column of the file. If --mapq is not set, mapping qualities will be ignored even if they are present.
--fastq With this option, Referee scores will also be output in FASTQ format. Scores will be converted to ASCII characters: score + 35 = ASCII char. Note 1: If --correct is set, corrected bases will appear as lower case. Note 2: This option cannot be set with --mapped.
--bed Referee can output scores in binned BED format for visualizing tracks of scores in most genome browsers. One .bed file will be created for each scaffold scored and these will be placed in a directory ending with -bed-files. Note: This option cannot be set with --mapped.
--haploid Set this option if your input sequencing data comes from a haploid species. Referee will limit it's likelihood calculations to single base states. Note: This option can only be used with an input --pileup file.
--correct With this option, sites where reads do not support the called reference base (score <= 0) will have a higher scoring base suggested. In the tab delimited output, the corrected base and score are reported in additional columns. In FASTQ output, the corrected positions are indicated by lower case bases.
--mapped Only report scores for sites with reads mapped to them. Note: This option cannot be set with --fastq or --bed.
--quiet Set this option to prevent Referee from printing out runtime statistics for each step.
-p The number of processes Referee can use.
-l The number of input lines read per process. Default: 100000. Decreasing this number may improve memory usage at the cost of slightly higher run times.