This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.

Getting started

Hello! Today we'll be going through some hands-on activities to help you get familiar with the steps involved in genome assembly and quality assessment.

The first thing you should do if you haven't done so is connect to the ConGen server. We'll be working exclusively in the RStudio browser interface that you should be familiar with by now, but if you have questions or problems at any point please feel free to ask! Just in case, here's a annotated picture of roughly what you should be seeing right now. If you are seeing something drastically different or something that you don't understand, let us know.

Figure 1.1: The RStudio interface for running commands and browsing files.

Most of our work will be done as bash commands typed in the Terminal provided by RStudio. Throughout this walkthrough, commands will be presented as follows:

this is an example command

Following each command will be a table that goes through and explains each part of the command explicitly:

Command line parameterDescription
thisAn example command
isAn example option used in the example command
anAn example option used in the example command
exampleAn example option used in the example command
commandAn example option used in the example command

The goal of providing these tables is to break-down some of the 'black box' that command line tools can sometimes feel like. Hopefully this is helpful. If not, feel free to skip over these tables when you see them!

Tip - Help menus

A general convention among command-line software is to provide a help menu for programs that lists common options. These can generally viewed from the command line with the -h option as follows:

<program> -h  -or-  <program> <sub-program> -h

For Linux commands, documentation is generally available with the man command (man is short for manual):

man <command>

man opens a text viewer that can be navigated with the arrow keys and exited simply by typing q. If you're ever stuck or want to know more about a program's options, try these!

Commands that you should run will have a green background. We will also provide some commands that are beneficial to see, but do not necessarily need to be run using a red background, like so:

this is an example command that won't be run

Additionally, one of the most important and often overlooked parts of bioinformatics analyses is to simply look at ones data. There will be several points where we stop to look at the output of a given program or command. When we do, a snippet of the output will be presented in the walkthrough as follows:

Here is some made up output.
Looking at your data is very important!
You can catch problems before you use the data in later analyses.
Drosophila data

Today we'll be talking about genome assembly. While the programs that perform the various steps of assembly have greatly improved over the past few years, genome assembly is still generally a slow, multi-step process. Given that, we'll be working with a smaller 150Mb genome from Drosophila pseudoobscura. Many times we will limit our data to just chromosome 2 (32Mb) of D. pseudoobscura to speed up run times even more.

D. pseudoobscura is a species of fruit fly that diverged from the well known D. melanogaster model species roughly 50 million years ago. D. pseudoobscura chromosome 2 is homologous to D. melanogaster chromosome arm 3R.

Figure 1.2: Drosophila phylogeny and homology between chromosome arms. From Schaeffer et al. 2008

Using these data, we'll be performing the following tasks today:

  1. Assembling the D. pseudoobscura genome and assessing the quality of said assembly.
  2. Mapping reads from D. pseudoobscura chromosome 2 to D. melanogaster chromosome 3R.
  3. Assessing how iterative mapping of reads from D. pseudoobscura chromosome 2 to D. melanogaster affects divergence estimates.
Input files & output prep

Copying input data

The data we'll be working with today are mainly sequences in FASTA and FASTQ format (more on those in a moment). The input files are located on the server at ~/instructor_materials/Gregg_Thomas/congen-assembly/. Let's make a copy of this directory in our home directory so we don't have to worry about that path anymore. First, make sure you are in your home directory:

cd ~
Command line parameterDescription
cdThe Linux change directory command
~The path to the directory you want to change to. In bash, ~ is a shortcut meaning "the current user's home directory."

Now let's make a copy of the data directory:

cp -r instructor_materials/Gregg_Thomas/congen-assembly/ .
Command line parameterDescription
cpThe Linux copy command
-rRecursively copy all files in a directory.
instructor_materials/Gregg_Thomas/congen-assembly/The path to the directory you want to copy.
.The path to the new copy. In bash, . is a shortcut meaning "the same name." So this will copy the directory to our current location with the name congen-assembly

Let's move into this folder with cd again since we'll spend the rest of the workshop here:

cd congen-assembly
Command line parameterDescription
cdThe Linux change directory command
congen-assemblyThe path to the directory you want to change to.

And now we can look at what is in that folder with the ls command. Make sure you've selected your Terminal window and type the following:

ls
Command line parameterDescription
lsThe Linux list directory command to view the files in a folder. Shows files in current folder by default.

You should see the following folders listed

FolderDescription
dmel-3R-reference/A folder containing the D. melanogaster chromosome 3R sequence file and its indices.
dpse-chr2-reads/A folder containing Illumina short reads for D. pseudoobscura.
expected-outputs/Pre-run outputs for all the programs we run today. If you get stuck or something takes too long, look for the expected output here
scripts/A few supplementary scripts for data analysis.
Tip - Pre-generated outputs

We have tried to anticipate the expected outputs from the commands we run today. If you get behind or stuck on something, try moving on to the next step by adding expected-outputs/ to the beginning of the path for input you were expected to generate for the next command. Feel free to ask us for help for any specific command.

Preparing output directories

I like to try and think ahead about what outputs my project will produce and make those directories early on, which helps me plan out my workflows. Today we'll be generating assemblies, read mappings, and alignments, so let's prepare an output directory for each of those tasks.

mkdir alignments
Command line parameterDescription
mkdirThe Linux make directory command
alignmentsThe name of the directory where we will generate and store our alignments for comparing assemblies and read mappings
mkdir assemblies
Command line parameterDescription
mkdirThe Linux make directory command
assembliesThe name of the directory where we will generate and store our assemblies
mkdir mapped-reads
Command line parameterDescription
mkdirThe Linux make directory command
mapped-readsThe name of the directory where we will generate and store our read mappings

We'll also be running the program FastQC, which requires a pre-made output directory:

mkdir fastqc-output
Command line parameterDescription
mkdirThe Linux make directory command
fastqc-outputThe desired name of the new directory.

Some other programs we run will create their own output directories. We should now be ready to run the commands to generate assemblies and read mappings. But first, let's get familiar with our input data, sequences and reads...