Definitions of frequently used terms in bioinformatics

Sometimes the hardest part of learning a new topic is learning the terminology or jargon that those within the community commonly use. Here is a table of some terms that are common, but may be unfamiliar to someone new to the field of data science. Some of these are my attempt to define abstract terms. If you want any terms defined or added to the list, or you feel the definitions are inaccurate, please contact me.

Importantly, while some terms technically have different meanings, they are often used synonymously. I have tried to indicate these terms with matching number of asterisks.

TermDefinition
Terminal*The window in which you type commands in to be interpreted by the shell
Console*Similar to terminal, but full screen with no graphical component.
Shell*The program that interprets commands typed into a terminal. In your terminal you can type echo $SHELL to check which shell is loaded.
BashA common shell program.
Command line*The location where commands are type within the terminal window
Command prompt*The information displayed on the command line before the cursor
CommandA set of instructions (code) that can be interpreted by the shell
ProgramA set of code designed for a specific task. Similar to command, but more general (i.e., a program is not limited to the shell's scripting syntax).
LibraryFiles containing general code blocks that can be used widely by different programs
DependencyA program or library that is required for another program to run.
PackageA program and all it's dependencies.
ModuleSimilar to library.
ArgumentOptions specified in the command line when running a program or command.
DirectoryA named location on a computer that contains files and/or other directories.
FileA named location on a computer that contains data, commonly in the form of plain text.
ScriptA type of file whose contents are code or commands to be interpreted by the shell or another interpreter (e.g., Python).
Repository/RepoFrom git, a directory of files, possibly including code, documentation, or data.
File systemThe way in which files and directories are organized in a nesting, tree-like structure.
RootThe lowest level in the file system in which all directories and files are stored. Critical system files are stored close to the root. Usually located at / and usually only accessible by the computer's administrators.
HomeA user's home directory is where that user has read, write, and execute permissions within the file system.
PathThe location of a file or directory within the file system, with directories separated by slash characters (/).
Absolute pathThe full name of a file or directory that includes all directories and sub-directories starting from the root of the file system to the specified file or directory.
Relative pathThe name of a file or directory that includes all directories and sub-directories starting from the user's current location.
Operating systemThe software that interfaces between the computer's hardware and other user facing software.
ServerA computer setup to have users connect and work on it remotely, usually with more resources than personal computers to accommodate more resource intensive commands and multiple users.
ClusterAn interconnected collection of servers setup such that users can connect to one and specify high resource commands to run which are distributed to the others based on available resources.
NodeOne computer within a cluster.
Login nodeThe node within the cluster which users connect to and interact with.
Head nodeThe node within the cluster which handles job scheduling and resource allocation.
JobA submitted command or set of commands passed from the user to the job scheduler on a cluster.
Job SchedulerSoftware that runs on a cluster that monitors and configures resource usage. Users directly interact with the job scheduler to submit jobs to be run and the scheduler delegates when and on what node they will be run. On our Cannon cluster, our job scheduling software is SLURM.
Definitions of frequently used terms in phylogenetics
TermDefinition
Comparative genomicsThe field in which DNA sequences from different species are compared to each other, often in the context of their phylogeny, to identify variants in the sequence that may play a role in adaptation.
HomologSequences that share ancestry
OrthologsHomologs that descend from a speciation event
ParalogsHomologs that descend from a duplication event
AlignmentWhen DNA or amino acid sequences are arrayed in a matrix such that rows represent different sequences and columns represent individual sites, with adjustment for insertions and deletions by gap insertion. In comparative genomics, alignments are usually performed among sequences that share ancestry (are homologs).
Whole genome alignmentAn alignment of complete genome sequences with others
LocusA general name for any discrete region of the genome. Plural: loci.
Conserved non-coding elementsRegions of the genome that do not code for proteins but are nevertheless conserved relative to the rest of the genome. Examples include regulatory regions or non-coding RNAs.
PhylogenyA branching representation of the evolution of species, genomes, or individual loci
Newick formatA way to represent phylogenies in text format using nested parentheses, e.g. ((A,B),C);
NodeThe joining point of two branches in a phylogeny, representing the ancestor of the two descending branches.
Branch lengthA number that represents the length of a branch in a phylogeny, commonly in units of either relative number of substitutions that occurred on that branch or absolute time. In Newick format branch lengths are indicated by a colon and a number, e.g. ((A:1,B:2):3,C:2);
Substitution rateThe rate at which mutations fix in a population, becoming the dominant allele.
4-fold degenerate sitesSites within the protein coding region of a gene that result in the same amino acid translation regardless of the nucleotide present. Often used to estimate the neutral rate of evolution as these sites are assumed to be unconstrained by selection.
Species treeA phylogeny inferred for a set of species by combining information across the genome in order to represent the history of speciation. Common methods for species tree inference are maximum likelihood on concatenated gene alignments or summaries of gene trees.
Gene treeA phylogeny inferred for a set of species from the alignment of a single protein coding gene. Gene trees can capture speciation history as well as gene duplication events. Gene tree is also an umbrella term to refer to phylogenies inferred from the alignment of ANY short genomic region (e.g. conserved non-coding elements).
Phylogenetic discordanceProcesses like ancestral polymorphisms and introgression, combined with recombination over generations lead to different regions of the genome having different evolutionary histories. As a result, a phylogeny inferred from one region of the genome may be different than from another region (i.e. gene trees may differ from one another). This also refers to an individual gene tree differing from that of a species tree.
Concordance factorsA method to assess the underlying concordance of a phylogeny by measuring how many gene trees or alignment sites agree with the inferred topology.
Definitions of frequently used terms in the context of PhyloAcc
TermDefinition
.mod fileThe output file from phyloFit that contains the transition rate matrix for the probabilty of each nucleotide to change to another as well as the input phylogeny with branch lengths estimated from likely neutrally evolving sites (e.g. 4-fold degenerate sites).
ModelThe probability of three models is calculated for each locus, M0, M1, and M2.
M0A model which constrains substitution rates to either background/neutral or conserved rates for all branches.
M1 (target model)A model which allows a pre-specified set of target branches to have acclerated substitution rates.
M2 (full or free model)A model which allows any branch to have any substitution rate (background, conserved, or accelerated).
Bayes factorThe ratio of the marginal probabilities of two models to assess which model better fits the data.
BF1 or logBF1The Bayes factor comparing M1 to M0. logBF1 = P(M1) / P(M0)
BF2 or logBF2The Bayes factor comparing M1 to M2. logBF1 = P(M1) / P(M2)
BF3 or logBF3The Bayes factor comparing M2 to M0. logBF1 = P(M2) / P(M0)
Bayes factor cut-offA number specified such that a model is considered supported relative to another if the Bayes factor is above it.
Conservation stateFor any given locus, each branch in the input phylogeny is estimated to be in one of 3 states regarding its substitution rate: conserved, background (neutral), or accelerated.
Z matrixThe matrix that represents conservation states for each branch in the input phylogeny. Conservation states are coded as 0 = background, 1 = conserved, and 2 = accelerated.
Z scoreThe coded conservation state for any given branch in a locus (e.g. 0, 1, or 2).
Common bioinformatics file formats
FormatUseLinkSpecs
.txtGeneral plain text. May be formatted in some way that is unspecified by this extension (i.e., a .txt file could contain comma separated values).WikipediaNA
.csvGeneral data storage with rows and columns. Columns are separated by commas.WikipediaLink
.tab/.tsvGeneral data storage with rows and columns. Columns are separated by tabs.WikipediaNA
FASTAStores sequence data.WikipediaNA
FASTQStores sequence data and quality scores.WikipediaLink
SAMSequence Alignment Map format. Stores information about reads mapped to a reference genome.WikipediaLink
BAMBinary Alignment Map format. The compressed binary version of SAM format.WikipediaLink
CRAMAnother compressed format to store read mapping information.WikipediaLink
VCFVariant Call Format. Used to store information about variants inferred for a given sample(s). .vcf files are a specific type of tab delmited format.WikipediaLink
BCFBinary variant Call Format. The binary compressed verion of a VCF.WikipediaLink
BEDStores coordinates of regions of interest. .bed files are a specific type of tab delimited format.WikipediaLink
GFFStores annotation information from a genome. .gff files are a specific type of tab delimited format.WikipediaLink
GTFStores annotation information from a genome. GTF is an earlier version of GFF, but still commonly used, notably by the Ensembl database. .gtf files are a specific type of tab delmited format.WikipediaLink