This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.

Definitions of frequently used terms
TermDefinition
Sequence librariesA sample of DNA that has been processed to be sequenced.
ReadsFragmented and overlapping pieces of a DNA strand that are sequenced.
Phred quality scoresA scaled probability that a given inference (usually a called base) is incorrect. The probabilty of error, P(e), scaled by -10 log P(e).
Short readsReads from first and second generation sequencing such as Sanger, Illumina, IonTorrent, etc. Short reads can range from 30-1000bp long.
Read pairMany short read sequencing technologies sequence from both ends of a DNA fragment, resulting in a pair of sequenced reads that come from said fragment.
AdapterA short piece of DNA that is ligated to the short fragment to be sequenced. The adapter allows the fragment to be affixed to a physical medium (such as a flow cell) to facilitate amplification and sequencing.
Insert sizeThe size of the DNA fragment between the adapter sequences.
Mate pairsLong-insert paired end reads prepared by circularizing longer DNA fragments.
Jumping librariesJunction-fragment libraries. Mate pair libraries.
Long readsReads from single-molecule sequencing technology such as PacBio SMRT and Oxford Nanopore. Long reads can range from 1000-100000+bp long.
Genome assembly, Assembly, de novo Assembly1. The process by which small overlapping parts of the genome are reconstructed into longer contiguous sequences, 2. A sequence that has undergone the assembly process.
ContigsAssembled reads. Contig assembly is usually done with a graph-based representation (i.e. de Bruijn graphs) of overlapping sequence reads.
ScaffoldsContigs that have been joined together to form longer sequences. Scaffolding is usually done using read pair information or long reads.
Reference genomeAn already assembled genome to which you can compare newly sequenced reads or genomes.
Read mappingThe process of aligning reads from a newly sequenced genome to a reference genome
Mapping qualityA usually Phred scaled probability that a given read has mapped incorrectly.
Reference-guided assembly1. The process of using read mapping to reconstruct the genome from a set of reads, 2. A sequence that has undergone the reference-guided assembly process.
Reference biasThe phenomenon of a set of mapped reads appearing to resembe (through lower divergence) the reference genome more closely than they actually do because reads containing the most variation were not mapped.
Iterative mappingThe process of mapping reads to a reference genome, generating a reference-guided assembly, and then repeating the process this time mapping to the new reference guided assembly. Done to reduce reference bias.
File formats
FormatUseLinkSpecs
FASTAStores sequence data.WikipediaNA
FASTQStores sequence data and quality scores.WikipediaLink
SAMSequence Alignment Map format. Stores information about reads mapped to a reference genome.WikipediaLink
BAMBinary Alignment Map format. The compressed binary version of SAM format.WikipediaLink
CRAMAnother compressed format to store read mapping information.WikipediaLink
VCFVariant Call Format. Used to store information about variants inferred for a given sample(s).WikipediaLink
BCFBinary variant Call Format. The binary compressed verion of a VCF.WikipediaLink
BEDStores coordinates of regions of interestWikipediaLink