ConGen2021 - Intro to Bioinformatics

Page contents

One of the most important and under-taught aspects of data science is project organization. But part of the reason it isn't mentioned that often is because there are so many good ways to do it. What follows is an explanation of how we setup this project, with justifications for why we think this is a good setup. Hopefully, you can use this as the basis to figure out a file organization system that works for you!

File system refresher

Any project organization will ultimately come back to understanding how file systems work. Whether you're on Linux, Mac, Windows, or something else, they generally have a lot in common:

Files are organized inside folders. Folders can contain other folders, which leads to a nesting, tree-like structure.
Files are referred to by paths, which are simply all folders and nested sub-folders in which the file lives. Folders and files in paths are separated by a slash (/) character*. In graphical file explorers, these paths are often hidden as they are not needed to access files, but in text-based interactions with the file system, it is important to know your paths.
Command lines often recognize both absolute and relative paths. An absolute path is one that specifies all folders starting from the root of the file system, while a relative path is one that specifies all folders starting from the current working directory.
For example, if I am in folder A, which contains folder B, and want to list the contents of folder B, I could type:
```
ls /root/users/username/A/B/
```
OR
```
ls B
```
These are equivalent commands, with the first providing the absolute path, and the second providing only the relative: Since I'm already in folder A, the path to folder B is just B.
Usually, it isn't necessary to know the whole file system to the root -- paths relative to one's Home or User (or data/scratch) folder are sufficient. In terminals of Unix-based systems (Mac and Linux) and modern Windows versions, the path to a user's home directory is stored as the tilde character (~).
Other paths are stored as shortcuts: (.) means "the current directory" and (..) means "the directory directly above the current directory."
So, given the example earlier, if I'm now in folder B and want to list the contents of folder A, I can just type:
```
ls ..
```

Organizing a project

So what does all of that have to do with starting a bioinformatics project? Well, the first thing I usually do to get started is to create a project folder and populate it with other folders I think I'll need for the project. Folders like data, scripts, results, etc, and things like that. Then, knowing how the underlying file system works, I can easily add files, install softare, and run commands on relative paths within the project folder.

Let's see how the current project folder's contents look using the tree command:

tree

Following each command will be a table that goes through and explains each part of the command explicitly:

Command line parameter	Description
tree	A command that lists the contents of the current directory and all sub-directories and displays them in a tree-like format.

This should show the full file tree descending from our current working directory:

.
├── data
│   ├── macaque-svs
│   │   ├── annotation-files
│   │   │   ├── Macaca_mulatta.Mmul_8.0.1.97.chromes.gtf.gz
│   │   │   └── macaque-genes.bed
│   │   └── macaque-svs-filtered.bed
│   └── wolf-snps
│       ├── Filtered_NAwolf_n35_variableSites_GenicRegions.recode.vcf
│       ├── pop_borealForest.txt
│       ├── pop_coastal.txt
│       └── pop_highArctic.txt
├── etc
├── README.md
├── results
└── scripts

7 directories, 8 files

So, like I said, I like to start off with some basic folders for each project:

Folder	Purpose
`data`	Holds all initial and processed data files for the project. This can include raw data (if not too large), data for reference genomes, and sample information (all in descriptive sub-folders). This can also include sub-folders for each step of analysis. For example, I might create numbered sub-folders along the way for each step like `01-Read-trimming`, `02-Read-mapping`, `03-Variant-calling`, etc.
`results`	After data has been processed, the end goal should be something easily analyzable, e.g. summary statistics in comma separated (.csv) or tab delimited (.tab, .tsv) format whenever possible. I like to keep these main results files here for easy access.
`scripts`	Any code that I write for a project goes here.
`etc`	Oftentimes project files accumulate that don't fit into one of these categories, such as an important figure from a paper I saved or notes I make. I usually lump all of those into the etc folder.

Other folders might be important as a project progresses. For instance, a folder called manuscript as I start writing, which itself contains sub-folders for figs, tables, and scripts.

Next, lets talk a bit about processing textual data with commands and the more conceptual advantages of manipulating data in this fashion.

This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.