This is the 2021 version of this workshop! For the most up to date version click here. To view the archive of all previous versions click here.

Definitions of frequently used terms

Sometimes the hardest part of learning a new topic is learning the terminology or jargon that those within the community commonly use. Here is a table of some terms that are common, but may be unfamiliar to someone new to the field of data science. Some of these are my attempt to define abstract terms. If you want any terms defined or added to the list, or you feel the definitions are inaccurate, please contact me.

Importantly, while some terms technically have different meanings, they are often used synonymously. I have tried to indicate these terms with matching number of asterisks.

TermDefinition
Terminal*The window in which you type commands in to be interpreted by the shell
Console*Similar to terminal, but full screen with no graphical component.
Shell*The program that interprets commands typed into a terminal. In your terminal you can type echo $SHELL to check which shell is loaded.
BashA common shell program.
Command line*The location where commands are type within the terminal window
Command prompt*The information displayed on the command line before the cursor
CommandA set of instructions (code) that can be interpreted by the shell
ProgramA set of code designed for a specific task. Similar to command, but more general (i.e., a program is not limited to the shell's scripting syntax).
LibraryFiles containing general code blocks that can be used widely by different programs
DependencyA program or library that is required for another program to run.
PackageA program and all it's dependencies.
ModuleSimilar to library.
ArgumentOptions specified in the command line when running a program or command.
DirectoryA named location on a computer that contains files and/or other directories.
FileA named location on a computer that contains data, commonly in the form of plain text.
ScriptA type of file whose contents are code or commands to be interpreted by the shell or another interpreter (e.g., Python).
Repository/RepoFrom git, a directory of files, possibly including code, documentation, or data.
File systemThe way in which files and directories are organized in a nesting, tree-like structure.
RootThe lowest level in the file system in which all directories and files are stored. Critical system files are stored close to the root. Usually located at / and usually only accessible by the computer's administrators.
HomeA user's home directory is where that user has read, write, and execute permissions within the file system.
PathThe location of a file or directory within the file system, with directories separated by slash characters (/).
Absolute pathThe full name of a file or directory that includes all directories and sub-directories starting from the root of the file system to the specified file or directory.
Relative pathThe name of a file or directory that includes all directories and sub-directories starting from the user's current location.
Operating systemThe software that interfaces between the computer's hardware and other user facing software.
ServerA computer setup to have users connect and work on it remotely, usually with more resources than personal computers to accommodate more resource intensive commands and multiple users.
ClusterAn interconnected collection of servers setup such that users can connect to one and specify high resource commands to run which are distributed to the others based on available resources.
NodeOne computer within a cluster.
Login nodeThe node within the cluster which users connect to and interact with.
Head nodeThe node within the cluster which handles job scheduling and resource allocation.
JobA submitted command or set of commands passed from the user to the job scheduler on a cluster.
Common bioinformatics file formats
FormatUseLinkSpecs
.txtGeneral plain text. May be formatted in some way that is unspecified by this extension (i.e., a .txt file could contain comma separated values).WikipediaNA
.csvGeneral data storage with rows and columns. Columns are separated by commas.WikipediaLink
.tab/.tsvGeneral data storage with rows and columns. Columns are separated by tabs.WikipediaNA
FASTAStores sequence data.WikipediaNA
FASTQStores sequence data and quality scores.WikipediaLink
SAMSequence Alignment Map format. Stores information about reads mapped to a reference genome.WikipediaLink
BAMBinary Alignment Map format. The compressed binary version of SAM format.WikipediaLink
CRAMAnother compressed format to store read mapping information.WikipediaLink
VCFVariant Call Format. Used to store information about variants inferred for a given sample(s). .vcf files are a specific type of tab delmited format.WikipediaLink
BCFBinary variant Call Format. The binary compressed verion of a VCF.WikipediaLink
BEDStores coordinates of regions of interest. .bed files are a specific type of tab delimited format.WikipediaLink
GFFStores annotation information from a genome. .gff files are a specific type of tab delimited format.WikipediaLink
GTFStores annotation information from a genome. GTF is an earlier version of GFF, but still commonly used, notably by the Ensembl database. .gtf files are a specific type of tab delmited format.WikipediaLink