This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.

de novo Genome assembly is the process by which overlapping sequence reads are pieced back together to form longer, contiguous sequences for analysis:

Figure 3.1: Genome assembly usually proceeds in steps that connect progressively longer segments of sequence.

Many factors can play a role in the ultimate success of a genome assembly, including sequencing error rate, heterozygosity of the target genome, length of the target genome, and number and length of reads sequenced.

Today, we'll be using both short reads to assemble the D. pseudoobscura genome and then using several metrics to compare several assemblies.

Assembly short reads with SPAdes

SPAdes is a de Bruijn graph based assembler that can handle many types of data, both short and long reads, and can scaffold contigs together using read pairs or pre-assembled contigs. It is highly automated, and combines results from graphs built from multiple values of k. SPAdes, while relatively fast among genome assemblers, still takes a while to run, so today we'll be running one assembly of D. pseudoobscura on short reads only and compare it to a pre-made assembly that uses both long and short reads.

Let's start by generating the short read assembly:

spades.py -1 dpse-chr2-reads/illumina-subsample/chr2-1mil_1.fastq.gz -2 dpse-chr2-reads/illumina-subsample/chr2-1mil_2.fastq.gz -k 77 -o assemblies/spades-illumina-chr2-k77 -t 4 --isolate
Command line parameterDescription
spades.pyCall the main SPAdes interface script.
-1 dpse-chr2-reads/illumina/chr2-1mil_1.fastq.gzThe path to the first read-pair file.
-2 dpse-chr2-reads/illumina/chr2-1mil_2.fastq.gzThe path to the second read-pair file.
-k 77The k-mer value. The length of sub-string overlaps used to construct the de Bruijn graph.
-o assemblies/spades-illumina-chr2-k77The path to the desired output directory.
-t 4The number of threads SPAdes can use. The default value is 16, but since there are 4 of us to a server we'll limit it to 4.
--isolate--isolate

This will build the SPAdes assembly from the Illumina short reads. It should take about 4 minutes to run.

SPAdes creates many output files within the specified output directory. Normally, this would include assemblies for multiple k-values. The final assembly would be constructed by comparing portions of each of the k-mer graphs and selecting the most resolved regions. However, due to time constraints, we have limited spades to one value of k=77.

Recall that k is the length of the sub-strings used to construct the de Bruijn graph. Reads a broken up into all possible strings of length k. Counterintuitively, this actually leads to smaller graphs than the overlap graphs, which use full reads, because in the de Bruijn graph, identical overlaps (repeats) are collapsed into a single node:

Figure 3.2: Figure 2 from Pevzner 2001 showing how repeats are compacted in a de Bruijn graph.

Selection of k is critical. A k that is too low will mean your graph will have too many collapsed repeats and be unresolvable. Imagine a k=2, which would essentially be a graph made up of the 4 nucleotides looping back on themselves. On the other hand, a k that is too high could lead to graphs with too few overlaps, making the assembly of longer contigs impossible. Fortunately, most modern assembly algorithms (including SPAdes) optimize over many values of k, obviating the need for k-mer selection by the user.

Take a look at this nice visualization of how k affects assembly graph structure. Visualizations done using Bandage.

The final assembly for this run will be in the file assemblies/spades-illumina-chr2-k77/scaffolds.fasta

Assessing assembly quality

One of the main goals for a genome assembly program is to produce assembly scaffolds approaching chromosome length. This is difficult because repeats can often form unresolvable portions of the underlying assembly graph that prevent the extension of a sequence. This means genome assemblies can be highly fragmented.

Here are some important statistics to consider when assessing a fragmented assembly:

StatisticDescription
Total assembly sizeThe sum of the length of all contigs. If you have an expected genome size from a closely related organism or other analysis, this can help determine how much of the genome you have assembled.
N50The length, N, of the scaffold/contig such that 50% of the assembly is contained in scaffolds of length N and longer. In other words, at least half of the nucleotides in the assembly belongs to contigs with the N50 length or longer.
L90The number of scaffolds that make up 90% of the assembly. Ideally close to the number of chromosomes in the sequenced organism.
Number of NsBases that can't be confidently called, especially near contig junctions within scaffolds, are represented by Ns in the FASTA assembly file. Counting the total number of Ns can give you an idea of how well contigs were joined together into scaffolds.

N50 can be a little tricky to understand, luckily the Molecular Ecologist blog has a simple explainer post, that has the following figure:

Figure 3.2: Scaffold N50. Source: Molecular Ecologist

L90 is similar, but using number of scaffolds instead of scaffold length.

These statistics are relatively easy to calculate, but there are also tools that can summarize these and more for a given assembly. Assemblathon was a competition to assess genome assemblers in 2013. They have helpfully made available some of the scripts they use to quantify final genome assemblies in their github repository. We will use one of their scripts to gather some basic statistics about a genome assembly.

Instead of using the assembly we just produced, which is only of a single chromosome, we've provided several D. pseudoobscura genomes to compare in the expected-outputs/assemblies/ folder. Let's run the assemblathon script on one of them:

perl /usr/bin/assemblathon_stats.pl expected-outputs/assemblies/spades-illumina-only.fasta > assemblies/spades-illumina-only-assemblathon.txt
Command line parameterDescription
perlInvoke perl to run a script.
/usr/bin/assemblathon_stats.plThe path to the perl script to run.
expected-outputs/assemblies/illumina-only-spades.fastaThe input FASTA file containing a genome assembly.
>This redirects the output from the command specified before it into the file specified after it.
assemblies/spades-illumina-only-assemblathon.txtThe desired output file.

This should produce the file assemblies/spades-illumina-only-assemblathon.txt which contains several statistics about the assembly

Let's take a look at this file:

less -S assemblies/spades-illumina-only-assemblathon.txt
Command line parameterDescription
lessA Linux text file viewer (use <up arrow> and <down arrow> to scroll up and down by one line or <space> and b to scroll up and down by one page, resepectively. Type q to exit)
-STurn off line-wrapping within less (use the <left arrow> and <right arrow> to scroll left and right).
assemblies/spades-illumina-only-assemblathon.txtThe path to the file you want to view

This should display a nice table of statstics:

---------------- Information for assembly 'assemblies/illumina-only-spades.fasta' ----------------


                                         Number of scaffolds     133829
                                     Total size of scaffolds  159563605
                                            Longest scaffold     656252
                                           Shortest scaffold         78
                                 Number of scaffolds > 1K nt       7924   5.9%
                                Number of scaffolds > 10K nt       2178   1.6%
                               Number of scaffolds > 100K nt        357   0.3%
                                 Number of scaffolds > 1M nt          0   0.0%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size       1192
                                        Median scaffold size        109
                                         N50 scaffold length      65232
                                          L50 scaffold count        626
                                                 scaffold %A      27.22
                                                 scaffold %C      22.57
                                                 scaffold %G      22.66
                                                 scaffold %T      27.43
                                                 scaffold %N       0.12
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

We know that D. pseudoobscura has a genome size of roughly 150Mb in 6 chromosomes. So the assembly size of 159563605 is about right, but we've only been able to resolve 133829 scaffolds. No where near the 6 chromosomes we would ideally want. Such a fragmented assembly could make annotation difficult with genes likely being split between scaffolds, and analysis of structural variation impossible.

What about some other assemblies? We've provided several more assemblies of D. pseudoobscura in the expected-outputs/assemblies/ folder, including a SPAdes assembly with long reads (PacBio + Nanopore), a Flye assembly with just PacBio reads and a Flye assembly with just Nanopore reads. The assemblathon output files are there as well, but the statistics we've talked about are summarized in the following table, as well as the latest published D. pseudoobscura genome on the SRA:

StatisticSRA assembly (PacBio/Canu) SPAdes (Illumina only) SPAdes (PacBio+Nanopore) Flye (PacBio) Flye (Nanopore)
Number of scaffolds 70 133829 122980 239 1321
Assembly size 163282969 159563605 159513481 158355727 151695649
Scaffold N50 32422566 65232 110786 17300572 5886457
L90 NA 12954 7018 15 70
% Ns NA 0.12 0.05 0.00 0.00

Clearly the long read assemblies with the long read assembler (Flye) have higher N50s and fewer scaffolds than the SPAdes assemblies. Note that this doesn't mean SPAdes is "worse" than Flye, it just means that for this type of data Flye is producing more contiguous assemblies. Also note that none of these assemblies alone produce scaffolds approaching the SRA assembly for D. pseudoobscura (PacBio/Canu), which has only 70 scaffolds, most assigned to one of the 6 chromosomes.

Annotation quality with BUSCO

BUSCO, short for Benchmarking sets of Universal Single-Copy Orthologs, can help to assess the completeness of your genome with respect to gene content. It does so by comparing gene sequences of well annotated species closely related to your species of interest to determine the number of Complete, Duplicated, Fragmented, and Missing genes present in an assembly among a set of universal orthologs.

You'll learn more about BUSCO in the next session, but for now we'll just take a look at a sample output from our SPAdes assemblies.

Figure 3.3: BUSCO results for the short read SPAdes assembly.

Figure 3.4: BUSCO results for the long read SPAdes assembly.

What we can see here is that, despite the short read assembly being much more fragmented, we still recover 98% of the universal copy orthologs, identical to the long read alignment. Depending on the questions you are asking, short read assemblies may be sufficient.

Preparing your assembly for analysis

Now that we have a FASTA file that contains our genome assembly, and we know roughly how complete it is from a technical (N50) and biological (BUSCO) perspective, we're ready to use it for analysis. Many downstream analyses require an assembly FASTA to be indexed. There are several types of indices and I usually make all of them for any assembly I work with. These indices are made with bwa index, and picard CreateSequenceDictionary.

All of these are simple to generate. Today we'll just generate the samtools faidx for our pre-generated SPAdes Illumina assembly:

samtools faidx expected-outputs/assemblies/spades-illumina-only.fasta
Command line parameterDescription
samtoolsCall the samtools program
faidxCall the faidx sub-program within samtools
expected-outputs/assemblies/spades-illumina-only.fastaThe path to the FASTA file to index

This should run fairly quickly and will produce the file expected-outputs/assemblies/spades-illumina-only.fasta.fai file that we can look at

less -S expected-outputs/assemblies/spades-illumina-only.fasta.fai
Command line parameterDescription
lessA Linux text file viewer (use <up arrow> and <down arrow> to scroll up and down by one line or <space> and b to scroll up and down by one page, resepectively. Type q to exit)
-STurn off line-wrapping within less (use the <left arrow> and <right arrow> to scroll left and right).
expected-outputs/assemblies/spades-illumina-only.fasta.faiThe path to the file you want to view
NODE_1_length_656252_cov_29.014074      656252  36      60      61
NODE_2_length_575306_cov_27.441711      575306  667262  60      61
NODE_3_length_562891_cov_29.873761      562891  1252193 60      61
NODE_4_length_438072_cov_30.397365      438072  1824502 60      61
NODE_5_length_388373_cov_29.564742      388373  2269912 60      61
NODE_6_length_381978_cov_29.826266      381978  2664794 60      61
NODE_7_length_364204_cov_28.969093      364204  3053175 60      61
NODE_8_length_351331_cov_28.605180      351331  3423486 60      61
NODE_9_length_348425_cov_29.467863      348425  3780709 60      61
NODE_10_length_333664_cov_29.009230     333664  4134979 60      61
NODE_11_length_325795_cov_29.005299     325795  4474242 60      61
NODE_12_length_315498_cov_28.812061     315498  4805504 60      61
NODE_13_length_304746_cov_29.743095     304746  5126298 60      61
NODE_14_length_301147_cov_28.087172     301147  5436161 60      61
NODE_15_length_300141_cov_28.878596     300141  5742365 60      61
NODE_16_length_291038_cov_28.065040     291038  6047546 60      61
NODE_17_length_279739_cov_29.975989     279739  6343472 60      61
NODE_18_length_267631_cov_28.541939     267631  6627911 60      61
NODE_19_length_262557_cov_28.861460     262557  6900040 60      61

.fai files are pretty straightforward. They contain tab-delimited columns with information for one scaffold on each row. Column 1 indicates the scaffold name, Column 2 the length, Column 3 a byte index, Column 4 the number of bases per line, and Column 5 the number of bytes per line

Other indices that we won't generate directly today include the BWA index files, which make it easier to map reads to an assembly:

bwa index assemblies/spades-illumina-only.fasta
Command line parameterDescription
bwaCall the bwa program
indexCall the index sub-program within bwa
assemblies/spades-illumina-only.fastaThe path to the FASTA file to index

And the Picard Dictionary file, which used by variant callers such as GATK and other programs:

Picard CreateSequenceDictionary I=assemblies/spades-illumina-only.fasta O=assemblies/spades-illumina-only.dict
Command line parameterDescription
picardCall the picard program
CreateSequenceDictionaryCall the CreateSequenceDictionary sub-program within picard
I=assemblies/spades-illumina-only.fastaThe path to the FASTA file
O=assemblies/spades-illumina-only.dictThe path to the desired output file

After you've generated and assessed the quality of your assembly, the next step is usually annotation. This usually means figuring out where repeats are in the assembly and classifying their types, and figuring out where the genes in the assembly are. There are many ways to do this, enough to fill up a separate workshop, so we won't be covering annotation today.

Next we will talk about an alternate to de novo genome assembly: reference guided assembly with read mapping...