ConGen2021 - Assembly Workshop

Page contents

Now that we're familiar with de novo assembly, let's look at another strategy for genome assembly: reference-guided assembly by read mapping. Using this strategy, the raw reads aren't put back together, but rather aligned to a reference genome:

Figure 4.1: Reference guided assembly via read-mapping.

Reference-guided assembly with read mapping has several advantages, including being fast and obviating the need for de novo genome annotation (as long as the reference genome is annotated). On the other hand, read mapping limits the type of mutations that can be inferred. For instance, while structural variation can be indirectly inferred based on relative read-depth, large genome rearrangements between the reference and the sample cannot be captured simply by read mapping. In other words, there's a lot about genome evolution that cannot be answered by mapping reads. Nevertheless, its utility and ease makes it a go-to strategy for systems with good reference genomes.

We'll be using BWA to map some short reads from D. pseudoobscura chromosome 2 to the homologous chromosome 3 in D. melanogaster and then assessing the read mapping.

Mapping reads with BWA and samtools

BWA is a short-read alignment program and is very straightforward. Simply tell it the reads you want to map and the reference genome and it will take care of the rest:

bwa mem -t 4 dmel-3R-reference/Drosophila_melanogaster.BDGP6.28.dna.chromosome.3R.fa dpse-chr2-reads/illumina/pse-chr2_1.fastq.gz dpse-chr2-reads/illumina/pse-chr2_2.fastq.gz | samtools sort > mapped-reads/dpse2-to-dmel3R.bam

You'll notice that in this command we're actually taking the output from BWA and piping it directly into samtools to sort the resulting read mappings by coordinate. This sorting is required by many programs that process mapped reads.

Here's a table that breaks down this command:

Command line parameter	Description
bwa	Call the BWA program
mem	Use the `mem` alignment algorithm implemented in BWA
-t 4	Tells BWA to use 4 threads to map reads
dmel-3R-reference/Drosophila_melanogaster.BDGP6.28.dna.chromosome.3R.fa	The path to the reference genome file
dpse-chr2-reads/illumina/pse-chr2_1.fastq.gz	The path to the FASTQ file containin the first pairs of reads
dpse-chr2-reads/illumina/pse-chr2_2.fastq.gz	The path to the FASTQ file containin the second pairs of reads
\|	This pipe means that the output from the command specified before it will be used as input to the command specified after it
samtools	Call the `samtools` program
sort	Call the sorting algorithm implemented `samtools`
>	This redirects the output from the command specified before it into the file specified after it
mapped-reads/dpse2-to-dmel3R.bam	The output file for the mapped and sorted reads

Mapped reads in SAM/BAM format

After this runs, we should get the file mapped-reads/dpse2-to-dmel3R.bam which contains the mapped reads in BAM format. SAM format is a text format that contains one mapped read per line, along with information about the mapping and BAM is just the compressed version of SAM format.

samtools view is a helpful tool that let's us retrieve lines from these files and convert between formats. SAM files also contain headers that tell us about how the file was generated and the reads that make up the file. Let's use samtools view to look at the header of our newly mapped reads.

First, let's change directories to the folder that contains the BAM file so we don't have to keep typing that in our commands:

cd mapped-reads

Command line parameter	Description
cd	The Linux change directory command
mapped-reads	The path to the directory we want to change to

Now let's look at the header of our BAM file:

samtools view -H dpse2-to-dmel3R.bam

Command line parameter	Description
samtools	Call the `samtools` program
view	Use the `view` sub-program implemented within samtools
-H	Only output the header information of the input file
dpse2-to-dmel3R.bam	The input file
\|	This pipe means that the output from the command specified before it will be used as input to the command specified after it
`head`	The Linux `head` program which returns only the first few lines of an input file
-n 3	Tells `head` to return only 3 lines of output

The output should look something like this:

@HD     VN:1.6  SO:coordinate
@SQ     SN:3R   LN:32079331
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1198-dirty   CL:bwa mem -t 4 dmel-3R-reference/Drosophila_melanogaster.BDGP6.28.dna.chromosome.3R.fa dpse-2-reads/illumina/pse-chr2_1.fastq.gz dpse-2-reads/illumina/pse-chr2_2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.10 CL:samtools sort
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.10 CL:samtools view -H dpse2-to-dmel3R.bam

The first line starting with @HD tells us the version of the SAM format that this file follows along with the sort order (SO of the file. In this case, since we ran samtools sort, the sort order is coordinate.

Lines starting with @SQ tell us about the sequence names and lengths of the reference genome. These would be the sequence headers from the reference FASTA file.

Finally, @PG tells us about the programs that were used to generate this file. These should look familiar!

Without the -H option, we can view the actual entries in the BAM file:

samtools view dpse2-to-dmel3R.bam | head -n 3

Command line parameter	Description
samtools	Call the `samtools` program
`view`	Use the `view` sub-program implemented within samtools
dpse2-to-dmel3R.bam	The input file

Now you should see some actual mapped reads!

SRR11230564.10809749    163     3R      33475   0       128S23M =       33475   23      GTAGACCCGACCGTTTTCCGGCCCAGATGACCTCCCCGATGCATAATATGTGCACCTGTCAACCAGAAGGTTCGTTTCTTAGGATTTCTATACTACAGTGCTGAGTAATTTTTCTCGCACTGTTTTTGGTAAAATAAAAAAAAAAAAAAAA    <AAAFJFFJJJFJFFJJJJJJJJAFJJFFJFJFJJJJJJJJJJFFJJJJJJJJJJJAJFAJFJJJJFJJJFFFF7-7--7AA7JA<<AFJJJFJJJJFAFFFJJFAJJJAFFFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-    NM:i:0  MD:Z:23 MC:Z:128S23M    AS:i:23 XS:i:23
SRR11230564.10809749    83      3R      33475   0       128S23M =       33475   -23     GTAGACCCGACCGTTTTCCGGCCCAGATGACCTCCCCGATGCATAATATGTGCACCTGTCAACCAGAAGGTTCGTTTCTTAGGATTTCTATACTACAGTGCTGAGTAATTTTTCTCGCACTGTTTTTGGTAAAATAAAAAAAAAAAAAAAC    AF77)JJJJJJJJFJJAJJJJJJJJJFJJFJF<FJJJJJJJJJJJJFJJFJJFFJJJFJJJJJJJJFJJJJJJJJJJJJJJJFJJJJFJJJJJJJFJFJJJJJJJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA    NM:i:1  MD:Z:22A0       MC:Z:128S23M    AS:i:22 XS:i:22
SRR11230564.16812502    99      3R      35215   0       55S20M76S       =       35233   38      GAGAGACCAGGCCCATAAAAACCTAAAGACGAAACTCGGCCAGTACGGGCGGTAAGAGTACGAGTACGGGACCGATTATTACACTTAAATTGTAACCAATTTAACGGCACGTAAAAACTAAAAGCAGTCTCGGTGAGATCGGAAGAGCGTC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJAA    NM:i:0  MD:Z:20 MC:Z:71S20M60S  AS:i:20 XS:i:20

This is a tab delimited output with one read mapping per line. Some of the columns should look familiar, like the read name (column 1), sequence (column 10), and quality scores (column 11) from the FASTQ file. The other columns are described on the BWA website or in the table in this helpful walkthrough that I've copied below:

Figure 4.2: The column definitions of SAM/BAM formatted files. Source.

Marking duplicate reads in SAM/BAM format

Duplicate reads can lead to spurious results for analyses that rely on depth of coverage, such as inferences of copy number variation. Depending on the sequencing machine used, the cause of duplicate reads can vary. For Illumina machines, duplicates can be caused by uneven PCR amplification (PCR duplicates) and a single spot on the flow cell being mistakenly identified as two separate spots (optical duplicates).

The best practice is to mark these duplicate reads in your final SAM/BAM file so that programs in later analyses can deal with them appropriately. To do this, we use the program Picard.

picard MarkDuplicates I=dpse2-to-dmel3R.bam O=dpse2-to-dmel3R.mkdup.bam VALIDATION_STRINGENCY=LENIENT M=dpse2-to-dmel3R-dupmets.txt CREATE_INDEX=true

Command line parameter	Description
picard	Call the picard program
MarkDuplicates	Use the MarkDuplicates sub-program implemented within picard
I=dpse2-to-dmel3R.bam	The input file
O=dpse2-to-dmel3R.mkdup.bam	The desired output file
VALIDATION_STRINGENCY=LENIENT	Tells picard not to exit if it encounters slightly different SAM formatting than expected
M=dpse2-to-dmel3R-dupmets.txt	The name for a file to output some metrics
CREATE_INDEX=true	Tells picard to also create and index file for the output BAM file. These indices are necessary for other programs to use the file.

This should create a new BAM file, dpse2-to-dmel3R.mkdup.bam, as well as an index for this BAM file and a metrics file. In this file, reads that have been determined to be duplicates will have 1024 added to their FLAG (column 2) in the output BAM/SAM file. See this page for more info about SAM FLAGS.

We can get a summary of how many reads were marked as duplicates from the metrics file:

grep -a2 "## METRICS CLASS" dpse2-to-dmel3R-dupmets.txt

Command line parameter	Description
grep	The Linux pattern search command. Given a string and a file, `grep` searches the file for the string and prints lines that contain matches to the screen. Can be used with regular expression pattern matching.
-a2	Tells `grep` to also print the two lines after the match is found.
"## METRICS CLASS"	The pattern we want to search for.
dpse2-to-dmel3R-dupmets.txt	The file in which we want to search for the specified pattern.

If the metrics file has been created succesfully, this should produce the following output:

## METRICS CLASS        picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES       PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
Unknown Library 844     200412  2766    673346  134     621     0       0.003426        32272106

Based on the column headers, columns 6, 7, and 8 tell us how many duplicate reads were identified. In this case, 134 unpaired read duplicates, 621 paired read duplicats, and 0 optical duplicates were found, for a total of 134 + 621x2 = 1376 duplicate reads.

If we wanted to, we could look at these reads using samtools view and the flag 1024:

samtools view -f 1024 dpse2-to-dmel3R.mkdup.bam

Command line parameter	Description
samtools	Call the `samtools` program.
view	Use the `view` sub-program within `samtools`.
-f 1024	Only look for lines that have this included in their `FLAG` column (column 2). 1024 is the MarkDuplicates flag.
dpse2-to-dmel3R.mkdup.bam	The input BAM/SAM file to view.

This should print all 1376 duplicate reads to the screen.

From here, the mapped reads can be used to calculate genotype likelihoods at each position to infer where the reference genome and the mapped genome differ. These variants can be used in downstream analyses to assess divergence, phylogenetic relationships, selection, or any other comparative analysis you can think of.

Figure 4.3: Variant calling from mapped reads. Green sites are true variants while dark red sites are sequencing errors.

We won't cover this in detail today, but we've provided a list of several variant calling programs on the Programs page.

Assessing mapped reads with read depth

One of the most common questions you'll hear when mapping reads to a reference (or even during de novo assembly) is what is the average read depth? Read depth is simply defined as how many reads align to a given position in the reference genome and is related to the term coverage. However, when referring to coverage one is often asking how many reads it would take to cover X number of positions in the genome. With so many reads being produced by modern sequencing machines, the two terms have become interchangeable because it is assumed all positions in the genome will be covered by reads.

One way for estimating coverage is to use the Lander/Waterman (1988) equation:

\[ Coverage = \frac{LN}{G} \]

Term	Definition
`Coverage`	The number of times a position in the genome has been sequenced. The number of reads "covering" a position.
`L`	Read length.
`N`	Number of reads.
`G`	Haploid genome length

We can use this equation in two ways:

Estimate read depth/coverage by solving as is.
Estimate the number of reads needed to obtain a certain amount of coverage by solving for N.

In our case, for D. pseudoobscura chromosome 2, we know L, N, and G from our fastqc analysis and genome assembly:

`L`	150
`N`	537507 * 2
`G`	32000000

\[ Coverage = \frac{150 * 537507 * 2}{32000000} = 5.04 \]

So we estimate our coverage at 5.04, meaning that we would expect every position to be covered by 5 reads.

Let's see if this matches up to the observed read depth. First, lets use samtools depth to get the number of reads that map to each position.

samtools depth dpse2-to-dmel3R.mkdup.bam > dpse2-to-dmel3R-depth.tab

Command line parameter	Description
samtools	Call the `samtools` program.
depth	Use the `depth` sub-program within `samtools`.
dpse2-to-dmel3R.mkdup.bam	The input BAM/SAM file.
>	This redirects the output that would have been printed to the screen to a file instead.
dpse2-to-dmel3R-depth.tab	The desired output file

The dpse2-to-dmel3R-depth.tab file simply contains the position in the reference genome and the number of reads mapped to this position. These files can get very large for large genomes, so its generally not adviseable to plot read depth per-base (as we're about to do). Instead, for larger genomes, windows of 10kb to 100kb can be averaged and analyzed together. Another good program for calculating per-base read depth with some clever space-saving output options is mosdepth.

But since we're looking at a small genome, let's just analyze this file! One of the pre-baked scripts provided is an R script that takes a samtools depth file as input and calculates averge read depth and generates some figures. Let's run it on the file we just created:

Rscript ../scripts/depth_plot.r dpse2-to-dmel3R-depth.tab

Command line parameter	Description
Rscript	Invokes R to run the script.
../scripts/depth_plot.r	The path to the script.
dpse2-to-dmel3R-depth.tab	The input file.

This script will take a couple of minutes to run, but should print the average read depth to the screen a histogram of read depth: depth-hist.png, which you can look at by clicking on it in the file browser if you'd like. I see that average read depth is about 4:

# Avg. read depth: 3.948774

This is lower than expected based on our calculation with the Lander/Waterman equation. Why? One cause would be many reads having sequencing errors leading to unmapped reads. This doesn't seem to be the case for our data, based on the FastQC results we obtained earlier. Another cause could be that the reads that contain the most variation between the two species were not able to be mapped. This is referred to as reference bias: reads that contain too much variation cannot be mapped, so the final mapped reads end up looking more like the reference genome than they actually are. Iterative mapping attempts to address this issue.

This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.