This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.

What we've covered is a brief introduction about basic command line skills to get started in data science and bioinformatics. Besides practicing the things we've covered, such as memorizing your file system, figuring out your project organization and how you interact with text files, and command and scripting syntax, there are many other skills that can further enhance your data science workflows. These skills and tools all build upon what we've introduced today, and could fill entire workshops themselves, so for now we will only provide a brief overview of these tools, but are happy to answer questions about them later!

Profiles

As you start connecting to remote servers to work more, you may find yourself doing a lot of tasks at startup, like loading the programs you need to do your work or setting up shortcuts. Luckily, there is a way to automate this with profiles.

Profiles are text files that exist in your home directory. They have specific names such that the operating system knows that they should be interpreted as scripts, meaning they contain a list of commands. These scripts are run automatically when a user logs in, so any tasks you find yourself doing often at start-up or otherwise, you can put in these files so they are done automatically.

In bash (the common Unix shell program we are using), these files are .bashrc, .bash_profile, and .profile. Note that all of them have a period (.) preceding their name, which means that these are hidden files and won't show up in the file system unless explicitly told to. You can see what hidden files you have in your home directory by typing:

ls -a ~
Command line parameterDescription
lsThe Linux list directory contents command
-aThis option tells ls to list ALL files
~A preset shortcut to your home directory path

When I do this on our server, I see both a .bashrc and .profile file. If these files don't exist on the server you use to work, you can create them! If they do exist, don't be surprised to see commands already in them, as the OS may automatically put some in. As you learn more about profiles, you can decide whether to keep these commands in yours or not.

Once you've located or created your profile, you can add any commands to it you want to be executed at login!

Read more about profiles:

Terminal multiplexers

The phrase terminal multiplexer definitely sounds advanced and cool, but they are actually really simple and can help streamline your bioinformatics work.

A terminal multiplexer is simply a program that allows you to open another terminal within your currently running terminal, kind of like having multiple tabs open in a browser. What's even better though is that, the second terminal will continue to run even if you disconnect. This is obviously great because it allows you to start a command that may take a while and come back to it later without having to worry about losing your work. Or you could start a command on your lab at work, suspend the multiplexed terminal, disconnect, walk home, reconnect, and resume the terminal to check on your progress! In essence, it allows you to run commands in the background (similar to nohup &, which we're not even covering because multiplexers are so much better), but also allows you to easily resume and interact with the process running.

There are serveal terminal multiplexers available for Linux and Mac, but the two most-used ones are:

  1. screen
  2. tmux

I personally use screen, and haven't found the need to mess around with any others. And I've put together a super brief guide to screen here.

Installing software with conda & conda environments

If you have any prior experience with data science in the command line you know that installing new software is often the most painful part of a project. The dreaded command not found and library path not found errors can stop you in your tracks. Some clusters have module systems that can make life a lot easier, but they often have only the most popular software, and you would have to wait for an administrator to install what you need.

Luckily, a program called Anaconda, along with it's software manager, conda. Anaconda is a distribution of the Python programming language that has grown into a useful tool for data science. Importantly, it allows users to install software on their local account with conda.

conda also has the capability to run software in different environments. An environment is simply an emulated fresh installation of the operating system. Since the software you want to use often depends on other software, conda handles the installation of these dependencies, and using separate environments for separate projects or software can often reduce these issues further.

bioconda is a channel of conda specifically tailored to bioinformatics software.

I have another brief guide to installing and using conda here.

Job scheduling with high performance computing (HPC) clusters

As you scale up your workflows, you may find that the resources on the server you login to may no longer be sufficient to run the programs you want. This is where scheduling jobs on HPC clusters comes in handy.

Basically, when you login to an HPC cluster, you login to one particular node (or computer) of that cluster. This node is designed to handle user interactions and other low-resource activities. On this node, or another node that this node talks to (the head node) a job scheduling software is usually installed. This software coordinates the resources of all users who are requesting to run jobs on the cluster and passes the jobs to other, high resource computer nodes on which the jobs are actually run. There is also often another node or set of nodes dedicated to big data storage.

Figure 6.1: A simple diagram of an HPC cluster (Source).

HPC clusters differ between institutions, so be sure to check how yours is setup. Hopefully the administrators provide some helpful documents to get started. But generally, you work out your workflow on some test data, then put the commands you want to run into a script, along with the resources needed to run, and then submit this script through the job scheduler.

There are many job schedulers out there, but a popular one is SLURM. I have a brief guide to the SLURM job scheduler implemented on the University of Montana's cluster, Griz, here.

Scaling up with GNU parallel

Again, as you increase your throughput you're going to start to wonder how you can run commands in parallel. A really versatile and powerful way to do this directly in the command line is with GNU parallel.

parallel can easily take a list of commands and a number of jobs and run that many jobs in parallel:

parallel -j 20 < list-of-cmds.txt
Command line parameterDescription
parallelThe GNU parallel program that allows users to run commands in parallel across multiple cores
-j 20This tells parallel to try and run 20 commands (jobs) at once
<Similar to the redirect shortcut (>) that saves output from a command to a file, this symbol (<) tells the command to read input from a file
list-of-cmds.txtA file with multiple commands in it to pass to parallel, with one command per line

This can really speed up your work. parallel may or may not be installed on your cluster, but is available as a conda package. I have a brief guide to parallel and how to use Python scripts to generate commands for it here.

Complex workflows with Snakemake

Snakemake is a Python-based scripting language specifically designed to help scale up and make bioinformatics workflows reproducible. The basic idea behind snakemake is that we often run the same commands over many different but similar files (e.g., VCF files from different samples, alignments from different genes, etc.) A snakemake file then consists of rules that depend on each other based on the specified output files of previous rules. Rules can be run over a range of files (e.g., samples) allowing one to compactly represent and run an entire workflow. Snakemake also integrates well with HPC clusters to run these rules in parallel.

While I've found snakemake to be extremely powerful, it is also somewhat difficult to learn. Importantly, it is based off of Python syntax, which allows for great integration with Python code, also means familiarity with Python is a must. I have also found the documentation for snakemake to be complex, though there may be some great tutorials out there that I haven't found yet.

All of this means that snakemake could take up a whole workshop, but I'm happy to answer any questions I can about it!

Searching for answers

I think I've saved the most important skill for last: how to search for answers on the internet. Many times you'll come across an error or want to figure out a solution to an analysis and not know where to start. Luckily, we have the vast collective knowledge of the internet to help us out!

Oftentimes simply searching for the error message you receive is enough to bring up some discussion about it, but that may not answer your question exactly. So I say that simply "knowing how to search" for something, or knowing what to type in to get you the answers you want, is one of the greatest skills of a data scientist. In that vain, it really helps to know more about the underlying program/command/data structure that the error is about to get you the answers you need.

That being said, there are some great resources out there besides search engines where people post their questions and others respond:

  1. Stack overflow
  2. Stack exchange for bioinformatics
  3. SeqAnswers
  4. BioStars
The other stuff

Of course this workshop focuses on text-based data science from the command line. But there is another side to data science and bioinformatics that we haven't touched on. I recently read a take about this that I agree with: data science can be broken down into two parts: (1) The Engineering Model (or the text model), where most work is done with text files in the command line and version control, and that we've covered here today, and (2) The Office Model where work is done on more centralized files that can be saved and passed between collaborators (think a Word Document for a manuscript).

This I think is a good distinction, but I would expand the Office Model to include data visualization as well, which can be done with things like Jupyter notebooks or R Markdown scripts.

All of this is to say that there is much more out there that we haven't covered.

Fin

This brings us to the end of our Introduction to Bioinformatics workshop. We'll be available to answer questions throughout the rest of the symposium. Thanks for attending!