PhyloAcc OEB275R - Fall 2022

Page contents

Getting started

Hello! Today we'll be going through some hands-on activities to help you get familiar with how PhyloAcc is run and how it can be used to identify genomic elements that have experienced accelerated evolution.

This course will have 2 parts: one where we are on the server and running commands and another where we download some pre-run data to analyze with R.

Most of our work in the first part of the course will be done as bash commands typed in the Terminal. Throughout this walkthrough, commands will be presented as follows:

this is an example command

Following each command will be a table that goes through and explains each part of the command explicitly:

Command line parameter	Description
this	An example command
is	An example option used in the example command
an	An example option used in the example command
example	An example option used in the example command
command	An example option used in the example command

The goal of providing these tables is to break-down some of the 'black box' that command line tools can sometimes feel like. Hopefully this is helpful. If not, feel free to skip over these tables when you see them!

Tip - Help menus

A general convention among command-line software is to provide a help menu for programs that lists common options. These can generally viewed from the command line with the -h option as follows:

<program> -h -or- <program> <sub-program> -h

For Linux commands, documentation is generally available with the man command (man is short for manual):

man <command>

man opens a text viewer that can be navigated with the arrow keys and exited simply by typing q. If you're ever stuck or want to know more about a program's options, try these!

Here is some made up output.
Looking at your data is very important!
You can catch problems before you use the data in later analyses.

Loading the PhyloAcc environment

If you want to follow along by running the commands, the first thing you should do if you haven't done so is to connect to Cannon, our cluster, such that you can run commands from a terminal. There are different ways to do this, but the easiest thing would to just open up Terminal (on Mac) or PowerShell (on Windows) and run the following command:

ssh [your user name]@login.rc.fas.harvard.edu

This should prompt you for your password and 2-factor authentication code, at which point you should see something like this:

Figure 1.1: Cannon right after logging in.

In addition to logging on to the server as above, we're also going to start an interactive session on one of the compute nodes so that we don't bog down any of the login nodes trying to run PhyloAcc:

salloc -p test --mem 12g -c 8 -t 0-02:00

Command line parameter	Description
salloc	The job scheduling command to allocate an interactive session.
-p	The option to specify which partition we want our job to run on, in this case the test partition.
--mem 12g	The option to specify how much memory to allocate to our job, in this case the 12 gigabytes.
-t 0-02:00	The option to specify how much time to allocate to our job, in this case the 2 hours.

Loading the PhyloAcc environment

Once logged in, we'll load the PhyloAcc package. I've pre-made a conda environment with PhyloAcc installed in it. To load it, first load Anaconda:

module load Anaconda3

Command line parameter	Description
module	The cluster's module system that contains pre-installed software.
load	The module sub-command telling it we want to load a package.
Anaconda3	The name of the package we want to load.

Next, load my pre-made environment

source activate /n/holylfs05/LABS/informatics/Everyone/phyloacc-data/workshop-20221027/env/phyloacc-workshop

Command line parameter	Description
source	The conda command to run scripts.
activate	The conda script to run which activates environments
/n/holylfs05/LABS/informatics/Everyone/phyloacc-data/workshop-20221027/env/phyloacc-workshop	The path to the environment we want to load.

Then, let's make sure everything loaded correctly by running a check:

phyloacc.py --depcheck

Command line parameter	Description
phyloacc.py	The main interface for PhyloAcc.
--depcheck	An option that tells PhyloAcc to check dependency paths.

When you do this, you should hopefully see something like this, with both binaries reporting PASSED statuses:

# --depcheck set: CHECKING DEPENDENCY PATHS AND EXITING.

   PROGRAM          PATH                STATUS
   -------------------------------------------
   phyloacc         PhyloAcc-ST         PASSED
   phyloacc-gt      PhyloAcc-GT         PASSED

# All dependencies PASSED.

If you don't see this, or one or both of the checks failed, please let me know.

Creating a project directory

To keep things organized, let's make a new folder specifically for this workshop. First let's make sure you're in your home directory:

cd ~

Command line parameter	Description
cd	The Linux change directory
phyloacc-workshop	The path to the directory you want to change to. In this case, ~ is a shortcut meaning "your home directory".

And this will create a folder in your home directory, but feel free to do it anywhere you like.

mkdir phyloacc-workshop

Command line parameter	Description
mkdir	The Linux create directory command
phyloacc-workshop	The name of the directory you want to create

Finally let's enter our new directory so any files we create will be put in it:

cd phyloacc-workshop

Command line parameter	Description
cd	The Linux change directory
phyloacc-workshop	The path to the directory you want to change to.

Now, let's move on to an intro to our data