What we've covered is a brief introduction about basic command line skills to get started in data science and bioinformatics. Besides practicing the things we've covered, such as memorizing your file system, figuring out your project organization and how you interact with text files, and command and scripting syntax, there are many other skills that can further enhance your data science workflows. These skills and tools all build upon what we've introduced today, and could fill entire workshops themselves, so for now we will only provide a brief overview of these tools, but are happy to answer questions about them later!
As you start connecting to remote servers to work more, you may find yourself doing a lot of tasks at startup, like loading the
programs you need to do your work or setting up shortcuts. Luckily, there is a way to automate this with profiles
.
Profiles
are text files that exist in your home directory. They have specific names such that the operating system knows
that they should be interpreted as scripts, meaning they contain a list of commands. These scripts are run automatically
when a user logs in, so any tasks you find yourself doing often at start-up or otherwise, you can put in these files so they
are done automatically.
In bash
(the common Unix shell program we are using), these files are .bashrc
, .bash_profile
, and .profile
.
Note that all of them have a period (.
) preceding their name, which means that these are hidden files and won't show up in the file
system unless explicitly told to. You can see what hidden files you have in your home directory by typing:
ls -a ~
Command line parameter | Description |
---|---|
ls | The Linux list directory contents command |
-a | This option tells ls to list ALL files |
~ | A preset shortcut to your home directory path |
When I do this on our server, I see both a .bashrc
and .profile
file. If these files don't exist on the server you use to work,
you can create them! If they do exist, don't be surprised to see commands already in them, as the OS may automatically put some in. As you
learn more about profiles, you can decide whether to keep these commands in yours or not.
Once you've located or created your profile, you can add any commands to it you want to be executed at login!
Read more about profiles
:
The phrase terminal multiplexer definitely sounds advanced and cool, but they are actually really simple and can help streamline your bioinformatics work.
A terminal multiplexer is simply a program that allows you to open another terminal within your currently running terminal, kind of like having multiple tabs open in a browser. What's even better though is that, the second terminal will continue to run even if you disconnect. This is obviously great because it allows you to start a command that may take a while and come back to it later without having to worry about losing your work. Or you could start a command on your lab at work, suspend the multiplexed terminal, disconnect, walk home, reconnect, and resume the terminal to check on your progress! In essence, it allows you to run commands in the background (similar to nohup &, which we're not even covering because multiplexers are so much better), but also allows you to easily resume and interact with the process running.
There are serveal terminal multiplexers available for Linux and Mac, but the two most-used ones are:
I personally use screen
, and haven't found the need to mess around with any others. And I've put together a super brief
guide to screen
here.
If you have any prior experience with data science in the command line you know that installing new software is often the most
painful part of a project. The dreaded command not found and library path not found errors can stop you in your tracks.
Some clusters have module systems
that can make life a lot easier, but they often have only the most popular software, and
you would have to wait for an administrator to install what you need.
Luckily, a program called Anaconda, along
with it's software manager, conda.
Anaconda
is a distribution of the Python programming language that has grown into a useful tool for data science. Importantly,
it allows users to install software on their local account with conda
.
conda
also has the capability to run software in different environments. An environment is simply an emulated fresh
installation of the operating system. Since the software you want to use often depends on other software, conda
handles the
installation of these dependencies, and using separate environments for separate projects or software can often reduce these issues
further.
bioconda is a channel of conda specifically tailored to bioinformatics software.
I have another brief guide to installing and using conda here.
As you scale up your workflows, you may find that the resources on the server you login to may no longer be sufficient to run the programs you want. This is where scheduling jobs on HPC clusters comes in handy.
Basically, when you login to an HPC cluster, you login to one particular node (or computer) of that cluster. This node is designed to handle user interactions and other low-resource activities. On this node, or another node that this node talks to (the head node) a job scheduling software is usually installed. This software coordinates the resources of all users who are requesting to run jobs on the cluster and passes the jobs to other, high resource computer nodes on which the jobs are actually run. There is also often another node or set of nodes dedicated to big data storage.
HPC clusters differ between institutions, so be sure to check how yours is setup. Hopefully the administrators provide some helpful documents to get started. But generally, you work out your workflow on some test data, then put the commands you want to run into a script, along with the resources needed to run, and then submit this script through the job scheduler.
There are many job schedulers out there, but a popular one is SLURM.
I have a brief guide to the SLURM
job scheduler implemented on the University of Montana's cluster, Griz,
here.
GNU parallel
Again, as you increase your throughput you're going to start to wonder how you can run commands in parallel. A really versatile and powerful way to do this directly in the command line is with GNU parallel.
parallel
can easily take a list of commands and a number of jobs and run that many jobs in parallel:
parallel -j 20 < list-of-cmds.txt
Command line parameter | Description |
---|---|
parallel | The GNU parallel program that allows users to run commands in parallel across multiple cores |
-j 20 | This tells parallel to try and run 20 commands (jobs)
at once |
< | Similar to the redirect shortcut (>) that saves output from a command to a file, this symbol (<) tells the command to read input from a file |
list-of-cmds.txt | A file with multiple commands in it to pass to parallel , with
one command per line |
This can really speed up your work. parallel
may or may not be installed on your cluster,
but is available as a conda
package. I have a brief guide to parallel
and how to use Python scripts to generate
commands for it here.
Snakemake is a Python-based scripting language specifically designed
to help scale up and make bioinformatics workflows reproducible. The basic idea behind snakemake
is that we often run
the same commands over many different but similar files (e.g., VCF files from different samples, alignments from different genes, etc.)
A snakemake
file then consists of rules that depend on each other based on the specified output files of previous rules. Rules can be
run over a range of files (e.g., samples) allowing one to compactly represent and run an entire workflow. Snakemake
also integrates
well with HPC clusters to run these rules in parallel.
While I've found snakemake
to be extremely powerful, it is also somewhat difficult to learn. Importantly, it is based off of
Python syntax, which allows for great integration with Python code, also means familiarity with Python is a must. I have also found
the documentation for snakemake
to be complex, though there may be some great tutorials out there that I haven't found yet.
All of this means that snakemake
could take up a whole workshop, but I'm happy to answer any questions I can about it!
I think I've saved the most important skill for last: how to search for answers on the internet. Many times you'll come across an error or want to figure out a solution to an analysis and not know where to start. Luckily, we have the vast collective knowledge of the internet to help us out!
Oftentimes simply searching for the error message you receive is enough to bring up some discussion about it, but that may not answer your question exactly. So I say that simply "knowing how to search" for something, or knowing what to type in to get you the answers you want, is one of the greatest skills of a data scientist. In that vain, it really helps to know more about the underlying program/command/data structure that the error is about to get you the answers you need.
That being said, there are some great resources out there besides search engines where people post their questions and others respond:
Of course this workshop focuses on text-based data science from the command line. But there is another side to data science and bioinformatics that we haven't touched on. I recently read a take about this that I agree with: data science can be broken down into two parts: (1) The Engineering Model (or the text model), where most work is done with text files in the command line and version control, and that we've covered here today, and (2) The Office Model where work is done on more centralized files that can be saved and passed between collaborators (think a Word Document for a manuscript).
This I think is a good distinction, but I would expand the Office Model to include data visualization as well, which can be done with things like Jupyter notebooks or R Markdown scripts.
All of this is to say that there is much more out there that we haven't covered.