Referee walkthrough

This page is a brief guide with command examples to go from reads and assembly to an assembly annotated with Referee's quality scores. We will first map all the reads used to make the assembly back to the assembly to obtain a single sorted BAM file. Then we will show how to use the BAM file to calculate genotype likelihoods with ANGSD or mpileup and Referee and produce a FASTQ reference file.

Important note: The following example is for Illumina reads. If you have reads from other technologies, the tools and steps may be different.

Step 0: Software used

The following walkthrough assumes you have these programs available to use:

BWA for read mapping.

Samtools and Picard for manipulation of SAM/BAM files and creating a pileup file (if necessary).

If you want to pre-calculate genotype log-likelihoods instead of creating a pileup and having Referee calculate them, we use ANGSD.

Referee for quality score calculation.

Step 1: Preparing the reference FASTA file.

We'll start from the point of having a completed genome/transcriptome assembly in fasta format: reference.fa

First we'll need to do the following 3 things to prepare the reference fasta file.

  1. Index the reference for mapping.
    bwa index reference.fa

    This command will produce 5 files (i.) reference.fa.amb, (ii.) reference.fa.ann, (iii.) reference.fa.bwt, (iv.) reference.fa.pac, (v.) reference.fa.sa. You won't need to look at these files, but they are used by BWA later on.

  2. Index the reference for samtools. Though not necessary for mapping, it will be necessary to have this index later on:
    samtools faidx reference.fa

    This command will produce the file reference.fai. You won't need to look at this file, but it will be used by samtools later on.

  3. We can also create the sequence dictionary with Picard. This is necessary if BWA doesn't create a proper header in the SAM file:
    java -jar picard.jar CreateSequenceDictionary R=reference.fa O=reference.dict

    This command will produce the file reference.dict.

In all, these commands simply create files that make the reference easier to read for these programs.

Step 2: Mapping the reads back to your assembly.

In addition to the assembled fasta file, we should also have all reads that were used to make this assembly in fastq format. These can be paired end or not, but for this example we will assume paired end reads: reads_1.fq and reads_2.fq

If there were multiple sequencing runs there will be multiple fastq files that were used to make the assembly. This just means you'll have to run some of the following commands on each fastq file set.

To map the reads back to the reference assembly, we use BWA, but any mapping software can be used as long as it makes sense the subsequent steps (i.e. produces SAM files as output).

bwa mem reference.fa reads_1.fq reads_2.fq > reads-mapped.sam

Remember, if you have multiple sequencing runs, you'll want to run BWA on each one, which will result in multiple SAM files.

Step 3: Preparing the BAM file.

If the header of the SAM file is improperly formatted from BWA (i.e. the first line does not start with @HD), you can use the reference dictionary created with Picard in Step 1 to fix it:

java -jar picard.jar ReplaceSamHeader I=reads-mapped.sam HEADER=reference.dict O=reads-mapped-hd.sam

Next, convert the SAM file to a BAM file with samtools:

samtools view -b reads-mapped-hd.sam > reads-mapped-hd.bam

If you had multiple sequencing runs you should have run bwa mem and samtools view (and possibly picard.jar ReplaceSamHeader) on each read set, resulting multiple BAM files. If this is the case, we can merge the BAM files now:

samtools merge reads-mapped-merged.bam reads-mapped-hd-1.bam reads-mapped-hd-2.bam ... etc.

Then we sort the single BAM file:

samtools sort reads-mapped-hd.bam -o reads-mapped-hd-sorted.bam

As a result of these steps, you should have a single .bam file, regardless of how many sequencing runs you had as input (thanks to the merge step).

Step 4: Preparing inputs for Referee

At this point, you could either pre-calculate genotype log-likelihoods with whatever method you choose (see README for proper file format), or make a pileup file so Referee can calulate genotype likelihoods. You only need to do one of Steps 4a or 4b.

  • Step 4a: Calculating genotype log-likelihoods with ANGSD

    If you want to pre-calculate genotype log-likelihoods for Referee, you can use any program you want. However, ANGSD is easy and provides the likelihoods in a format that Referee can read. Here is how to use ANGSD for genotype log-likelihood calculation:

    angsd -GL 2 -i reads-mapped-hd-sorted.bam -ref reference.fa -minQ 0 -doGlf 4 -o reads-gl

    This will produce 2 files: reads-gl.glf.gz which contains the genotype log-likelihoods and reads-gl.arg which is a log file produced by ANGSD.

    A brief explanation of the options used in the above command:

    -GL 2 Indicates we want to calculate genotype likelihoods
    -minQ 0 Sets the minimum mapping quality of a site to 0 to ensure all sites are used
    -doGlf 4 Use the original GATK method for calculating genotype likelihoods (same as Referee)
  • Step 4b: Generate a pileup file to calculate genotype likelihoods with Referee

    To have Referee calculate the genotype likelihoods, we'll need the mapped reads in pileup format:

    samtools mpileup -d 999999999 -f reference.fa -Q 0 -s -B -o reads.pileup reads-mapped-hd-sorted.bam

    This will produce a single file, reads.pileup

    A brief explanation of the options used in the above command:

    -d 999999999 Sets the maximum read depth to count a site to a value high enough such that all sites will be counted.
    -Q 0 Sets the minumum quality score for a site to be counted to 0. Again this ensures all sites are counted.
    -s Include mapping quality in the pileup output (optional).
    -B Disables mpileup's BAQ adjustment to the base quality scores.

Important note: Referee may give different scores depending on how you calculate genotype likelihoods. This is because different programs may filter or trim different reads when calculating the likelihoods. Just be aware of this moving forward. To guarantee that mpileup and ANGSD produce the same scores, make sure they use ALL reads by using the -x option in mpileup and the -remove_bads 0 option in ANGSD, though this is not the best practice.

Step 5: Calculate reference quality scores with Referee

If you used pre-calculated genotype log-likelihoods with ANGSD:

referee.py -gl reads-gl.glf.gz -ref reference.fa --fastq --correct -o reference

Or if you have a pileup for Referee to use:

referee.py -gl reads.pileup -ref reference.fa --pileup --fastq --correct -o reference

A brief explanation of the options used in the above command:

-gl This is the input file for Referee. It is the output file from step 4, either the .glf.gz file from ANGSD or the pileup file.
-ref The reference fasta file.
--pileup This flag indicates that the input specified with -gl is in pileup format. If you have a .glf.gz file from ANGSD, leave this off.
--fastq This tells Referee to produce the FASTQ output file.
--correct This tells Referee to correct sites that have more support for an alternate base than the one called in the assembly.
-o This is the prefix for all output created by Referee.

This will produce the following files:

reference.log A log containing info for the run.
reference.txt Tab delimited scores (columns: scaffold, position, Referee score, corrected base, corrected base score).
reference.fq FASTQ formatted reference genome.

See the README for more information on these file formats.