ConGen2024 - Assembly Workshop

Page contents

de novo Genome assembly is the process by which overlapping sequence reads are pieced back together to form longer, contiguous sequences for analysis:

Figure 3.1: Genome assembly usually proceeds in steps that connect progressively longer segments of sequence.

Many factors can play a role in the ultimate success of a genome assembly, including sequencing error rate, heterozygosity of the target genome, length of the target genome, and number and length of reads sequenced.

Today, we'll be using both short reads to assemble the D. pseudoobscura genome and then using several metrics to compare several assemblies.

Assembly short reads with SPAdes

SPAdes is a de Bruijn graph based assembler that can handle many types of data, both short and long reads, and can scaffold contigs together using read pairs or pre-assembled contigs. It is highly automated, and combines results from graphs built from multiple values of k. SPAdes, while relatively fast among genome assemblers, still takes a while to run, so today we'll be running one assembly of D. pseudoobscura on short reads only and compare it to a pre-made assembly that uses both long and short reads.

Let's start by generating the short read assembly:

spades.py -1 dpse-chr2-reads/illumina-subsample/chr2-1mil_1.fastq.gz -2 dpse-chr2-reads/illumina-subsample/chr2-1mil_2.fastq.gz -k 77 -o assemblies/spades-illumina-chr2-k77 -t 4 --isolate

Command line parameter	Description
spades.py	Call the main SPAdes interface script.
-1 dpse-chr2-reads/illumina/chr2-1mil_1.fastq.gz	The path to the first read-pair file.
-2 dpse-chr2-reads/illumina/chr2-1mil_2.fastq.gz	The path to the second read-pair file.
-k 77	The k-mer value. The length of sub-string overlaps used to construct the de Bruijn graph.
-o assemblies/spades-illumina-chr2-k77	The path to the desired output directory.
-t 4	The number of threads SPAdes can use. The default value is 16, but since there are 4 of us to a server we'll limit it to 4.
--isolate	--isolate

This will build the SPAdes assembly from the Illumina short reads. It should take about 4 minutes to run.

SPAdes creates many output files within the specified output directory. Normally, this would include assemblies for multiple k-values. The final assembly would be constructed by comparing portions of each of the k-mer graphs and selecting the most resolved regions. However, due to time constraints, we have limited spades to one value of k=77.

Recall that k is the length of the sub-strings used to construct the de Bruijn graph. Reads a broken up into all possible strings of length k. Counterintuitively, this actually leads to smaller graphs than the overlap graphs, which use full reads, because in the de Bruijn graph, identical overlaps (repeats) are collapsed into a single node:

Figure 3.2: Figure 2 from Pevzner 2001 showing how repeats are compacted in a de Bruijn graph.

Selection of k is critical. A k that is too low will mean your graph will have too many collapsed repeats and be unresolvable. Imagine a k=2, which would essentially be a graph made up of the 4 nucleotides looping back on themselves. On the other hand, a k that is too high could lead to graphs with too few overlaps, making the assembly of longer contigs impossible. Fortunately, most modern assembly algorithms (including SPAdes) optimize over many values of k, obviating the need for k-mer selection by the user.

Take a look at this nice visualization of how k affects assembly graph structure. Visualizations done using Bandage.

The final assembly for this run will be in the file assemblies/spades-illumina-chr2-k77/scaffolds.fasta

Assessing assembly quality

One of the main goals for a genome assembly program is to produce assembly scaffolds approaching chromosome length. This is difficult because repeats can often form unresolvable portions of the underlying assembly graph that prevent the extension of a sequence. This means genome assemblies can be highly fragmented.

Here are some important statistics to consider when assessing a fragmented assembly:

Statistic	Description
`Total assembly size`	The sum of the length of all contigs. If you have an expected genome size from a closely related organism or other analysis, this can help determine how much of the genome you have assembled.
`N50`	The length, N, of the scaffold/contig such that 50% of the assembly is contained in scaffolds of length N and longer. In other words, at least half of the nucleotides in the assembly belongs to contigs with the N50 length or longer.
`L90`	The number of scaffolds that make up 90% of the assembly. Ideally close to the number of chromosomes in the sequenced organism.
`Number of Ns`	Bases that can't be confidently called, especially near contig junctions within scaffolds, are represented by Ns in the FASTA assembly file. Counting the total number of Ns can give you an idea of how well contigs were joined together into scaffolds.

N50 can be a little tricky to understand, luckily the Molecular Ecologist blog has a simple explainer post, that has the following figure:

Figure 3.2: Scaffold N50. Source: Molecular Ecologist

L90 is similar, but using number of scaffolds instead of scaffold length.

These statistics are relatively easy to calculate, but there are also tools that can summarize these and more for a given assembly. Assemblathon was a competition to assess genome assemblers in 2013. They have helpfully made available some of the scripts they use to quantify final genome assemblies in their github repository. We will use one of their scripts to gather some basic statistics about a genome assembly.

Instead of using the assembly we just produced, which is only of a single chromosome, we've provided several D. pseudoobscura genomes to compare in the expected-outputs/assemblies/ folder. Let's run the assemblathon script on one of them:

perl /usr/bin/assemblathon_stats.pl expected-outputs/assemblies/spades-illumina-only.fasta > assemblies/spades-illumina-only-assemblathon.txt

Command line parameter	Description
perl	Invoke `perl` to run a script.
/usr/bin/assemblathon_stats.pl	The path to the `perl` script to run.
expected-outputs/assemblies/illumina-only-spades.fasta	The input FASTA file containing a genome assembly.
>	This redirects the output from the command specified before it into the file specified after it.
assemblies/spades-illumina-only-assemblathon.txt	The desired output file.

This should produce the file assemblies/spades-illumina-only-assemblathon.txt which contains several statistics about the assembly

Let's take a look at this file:

less -S assemblies/spades-illumina-only-assemblathon.txt

Command line parameter	Description
less	A Linux text file viewer (use `<up arrow>` and `<down arrow>` to scroll up and down by one line or `<space>` and `b` to scroll up and down by one page, resepectively. Type `q` to exit)
-S	Turn off line-wrapping within `less` (use the `<left arrow>` and `<right arrow>` to scroll left and right).
assemblies/spades-illumina-only-assemblathon.txt	The path to the file you want to view

This should display a nice table of statstics:

---------------- Information for assembly 'assemblies/illumina-only-spades.fasta' ----------------


                                         Number of scaffolds     133829
                                     Total size of scaffolds  159563605
                                            Longest scaffold     656252
                                           Shortest scaffold         78
                                 Number of scaffolds > 1K nt       7924   5.9%
                                Number of scaffolds > 10K nt       2178   1.6%
                               Number of scaffolds > 100K nt        357   0.3%
                                 Number of scaffolds > 1M nt          0   0.0%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size       1192
                                        Median scaffold size        109
                                         N50 scaffold length      65232
                                          L50 scaffold count        626
                                                 scaffold %A      27.22
                                                 scaffold %C      22.57
                                                 scaffold %G      22.66
                                                 scaffold %T      27.43
                                                 scaffold %N       0.12
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

We know that D. pseudoobscura has a genome size of roughly 150Mb in 6 chromosomes. So the assembly size of 159563605 is about right, but we've only been able to resolve 133829 scaffolds. No where near the 6 chromosomes we would ideally want. Such a fragmented assembly could make annotation difficult with genes likely being split between scaffolds, and analysis of structural variation impossible.

What about some other assemblies? We've provided several more assemblies of D. pseudoobscura in the expected-outputs/assemblies/ folder, including a SPAdes assembly with long reads (PacBio + Nanopore), a Flye assembly with just PacBio reads and a Flye assembly with just Nanopore reads. The assemblathon output files are there as well, but the statistics we've talked about are summarized in the following table, as well as the latest published D. pseudoobscura genome on the SRA:

Statistic	SRA assembly (PacBio/Canu)	SPAdes (Illumina only)	SPAdes (PacBio+Nanopore)	Flye (PacBio)	Flye (Nanopore)
Number of scaffolds	70	133829	122980	239	1321
Assembly size	163282969	159563605	159513481	158355727	151695649
Scaffold N50	32422566	65232	110786	17300572	5886457
L90	NA	12954	7018	15	70
% Ns	NA	0.12	0.05	0.00	0.00

Clearly the long read assemblies with the long read assembler (Flye) have higher N50s and fewer scaffolds than the SPAdes assemblies. Note that this doesn't mean SPAdes is "worse" than Flye, it just means that for this type of data Flye is producing more contiguous assemblies. Also note that none of these assemblies alone produce scaffolds approaching the SRA assembly for D. pseudoobscura (PacBio/Canu), which has only 70 scaffolds, most assigned to one of the 6 chromosomes.

Annotation quality with BUSCO

BUSCO, short for Benchmarking sets of Universal Single-Copy Orthologs, can help to assess the completeness of your genome with respect to gene content. It does so by comparing gene sequences of well annotated species closely related to your species of interest to determine the number of Complete, Duplicated, Fragmented, and Missing genes present in an assembly among a set of universal orthologs.

You'll learn more about BUSCO in the next session, but for now we'll just take a look at a sample output from our SPAdes assemblies.

Figure 3.3: BUSCO results for the short read SPAdes assembly.

Figure 3.4: BUSCO results for the long read SPAdes assembly.

What we can see here is that, despite the short read assembly being much more fragmented, we still recover 98% of the universal copy orthologs, identical to the long read alignment. Depending on the questions you are asking, short read assemblies may be sufficient.

Preparing your assembly for analysis

Now that we have a FASTA file that contains our genome assembly, and we know roughly how complete it is from a technical (N50) and biological (BUSCO) perspective, we're ready to use it for analysis. Many downstream analyses require an assembly FASTA to be indexed. There are several types of indices and I usually make all of them for any assembly I work with. These indices are made with bwa index, and picard CreateSequenceDictionary.

All of these are simple to generate. Today we'll just generate the samtools faidx for our pre-generated SPAdes Illumina assembly:

samtools faidx expected-outputs/assemblies/spades-illumina-only.fasta

Command line parameter	Description
samtools	Call the `samtools` program
faidx	Call the `faidx` sub-program within `samtools`
expected-outputs/assemblies/spades-illumina-only.fasta	The path to the FASTA file to index

This should run fairly quickly and will produce the file expected-outputs/assemblies/spades-illumina-only.fasta.fai file that we can look at

less -S expected-outputs/assemblies/spades-illumina-only.fasta.fai

Command line parameter	Description
less	A Linux text file viewer (use `<up arrow>` and `<down arrow>` to scroll up and down by one line or `<space>` and `b` to scroll up and down by one page, resepectively. Type `q` to exit)
-S	Turn off line-wrapping within `less` (use the `<left arrow>` and `<right arrow>` to scroll left and right).
expected-outputs/assemblies/spades-illumina-only.fasta.fai	The path to the file you want to view

NODE_1_length_656252_cov_29.014074      656252  36      60      61
NODE_2_length_575306_cov_27.441711      575306  667262  60      61
NODE_3_length_562891_cov_29.873761      562891  1252193 60      61
NODE_4_length_438072_cov_30.397365      438072  1824502 60      61
NODE_5_length_388373_cov_29.564742      388373  2269912 60      61
NODE_6_length_381978_cov_29.826266      381978  2664794 60      61
NODE_7_length_364204_cov_28.969093      364204  3053175 60      61
NODE_8_length_351331_cov_28.605180      351331  3423486 60      61
NODE_9_length_348425_cov_29.467863      348425  3780709 60      61
NODE_10_length_333664_cov_29.009230     333664  4134979 60      61
NODE_11_length_325795_cov_29.005299     325795  4474242 60      61
NODE_12_length_315498_cov_28.812061     315498  4805504 60      61
NODE_13_length_304746_cov_29.743095     304746  5126298 60      61
NODE_14_length_301147_cov_28.087172     301147  5436161 60      61
NODE_15_length_300141_cov_28.878596     300141  5742365 60      61
NODE_16_length_291038_cov_28.065040     291038  6047546 60      61
NODE_17_length_279739_cov_29.975989     279739  6343472 60      61
NODE_18_length_267631_cov_28.541939     267631  6627911 60      61
NODE_19_length_262557_cov_28.861460     262557  6900040 60      61

.fai files are pretty straightforward. They contain tab-delimited columns with information for one scaffold on each row. Column 1 indicates the scaffold name, Column 2 the length, Column 3 a byte index, Column 4 the number of bases per line, and Column 5 the number of bytes per line

Other indices that we won't generate directly today include the BWA index files, which make it easier to map reads to an assembly:

bwa index assemblies/spades-illumina-only.fasta

Command line parameter	Description
bwa	Call the `bwa` program
index	Call the `index` sub-program within `bwa`
assemblies/spades-illumina-only.fasta	The path to the FASTA file to index

And the Picard Dictionary file, which used by variant callers such as GATK and other programs:

Picard CreateSequenceDictionary I=assemblies/spades-illumina-only.fasta O=assemblies/spades-illumina-only.dict

Command line parameter	Description
picard	Call the `picard` program
CreateSequenceDictionary	Call the `CreateSequenceDictionary` sub-program within `picard`
I=assemblies/spades-illumina-only.fasta	The path to the FASTA file
O=assemblies/spades-illumina-only.dict	The path to the desired output file

After you've generated and assessed the quality of your assembly, the next step is usually annotation. This usually means figuring out where repeats are in the assembly and classifying their types, and figuring out where the genes in the assembly are. There are many ways to do this, enough to fill up a separate workshop, so we won't be covering annotation today.

This brings us to the end of our workshop. Thanks for attending!

In previous years, we covered read mapping as well, but were unable to this year due to time constraints. But feel free to view last year's workshop if you're interested!