ConGen2021 - Intro to Bioinformatics

Page contents

Getting started

Hello! Today we'll be going through some hands-on activities to help you get familiar with how many bioinformatics tasks can be done directly from the command line.

The first thing you should do if you haven't done so is connect to the ConGen server. We'll be working exclusively in the RStudio browser interface that you should be familiar with by now, but if you have questions or problems at any point please feel free to ask! Just in case, here's an annotated picture of roughly what you should be seeing right now. If you are seeing something drastically different or something that you don't understand, let us know.

Figure 1.1: The RStudio interface for running commands and browsing files.

Most of our work will be done as bash commands typed in the Terminal provided by RStudio. Throughout this walkthrough, commands will be presented as follows:

this is an example command

Following each command will be a table that goes through and explains each part of the command explicitly:

Command line parameter	Description
this	An example command
is	An example option used in the example command
an	An example option used in the example command
example	An example option used in the example command
command	An example option used in the example command

The goal of providing these tables is to break-down some of the 'black box' that command line tools can sometimes feel like. Hopefully this is helpful. If not, feel free to skip over these tables when you see them!

Tip - Help menus

A general convention among command-line software is to provide a help menu for programs that lists common options. These can generally viewed from the command line with the -h option as follows:

<program> -h -or- <program> <sub-program> -h

For Linux commands, documentation is generally available with the man command (man is short for manual):

man <command>

man opens a text viewer that can be navigated with the arrow keys and exited simply by typing q. If you're ever stuck or want to know more about a program's options, try these!

Commands that you should run will have a green background. We will also provide some commands that are beneficial to see, but do not necessarily need to be run using a red background, like so:

this is an example command that won't be run

Additionally, one of the most important and often overlooked parts of bioinformatics analyses is to simply look at ones data. There will be several points where we stop to look at the output of a given program or command. When we do, a snippet of the output will be presented in the walkthrough as follows:

Here is some made up output.
Looking at your data is very important!
You can catch problems before you use the data in later analyses.

Downloading the project

Today we'll be doing some basic bioinformatics tasks from the command line. We'll get to the specifics of the data later, but for now please download the project template we've provided on github.

First, make sure you're in your home directory. If you're not, or you're not sure if you are, run the command:

cd ~

Command line parameter	Description
cd	The Linux change directory command
~	The path to the directory you want to change to. ~ is a shortcut for "the current user's home directory."

Next, download the project repository using git:

git clone https://github.com/gwct/congen-bioinformatics.git

Command line parameter	Description
git	A cross platform program for vesrion control and syncing of software and data projects.
clone	The git sub-program to make download an exact copy of a repository.
https://github.com/gwct/congen-bioinformatics.git	The URL of the project repository. This can be found on the webpage of the repository.

Git is very powerful software for sharing your projects and used commonly to share code and data from scientific papers, but we won't talk about it much today other than using the clone command to download the project. You don't need a github account to clone a repository, but you do need git installed on your computer to do so.

Tip - More info about git

If you're interested in learning more about git there are a ton of guides and docs out there for you to search for. To get started, we've put together a couple of how-tos for understanding git basics here:

git how-tos

After the clone command completes, you should now have a folder in your home directory called congen-2021-bioinformatics. Make sure it's there with ls:

ls

Command line parameter	Description
ls	The Linux list directory contents command. With no other options given, this lists the contents of the current directory.

And next change into that directory:

cd congen-2021-bioinformatics

Command line parameter	Description
cd	The Linux change directory command
~	The path to the directory you want to change to.

Using this project template data, we'll be performing the following tasks today:

Talking about project organization, common commands, and text editors and work setups.
Introducing common bioinformatics file formats.
Using the command line to do a basic analysis of structural variation in a sample of 32 Rhesus macaques and of SNPs in 35 gray wolves.
Time permitting, briefly touch on some next steps in developing more advanced bioinformatics skills

Now, let's move on to Project Organization

This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.