PhyloAcc OEB275R - Fall 2022

Page contents

Definitions of frequently used terms in bioinformatics

Sometimes the hardest part of learning a new topic is learning the terminology or jargon that those within the community commonly use. Here is a table of some terms that are common, but may be unfamiliar to someone new to the field of data science. Some of these are my attempt to define abstract terms. If you want any terms defined or added to the list, or you feel the definitions are inaccurate, please contact me.

Importantly, while some terms technically have different meanings, they are often used synonymously. I have tried to indicate these terms with matching number of asterisks.

Term	Definition
Terminal*	The window in which you type commands in to be interpreted by the shell
Console*	Similar to terminal, but full screen with no graphical component.
Shell*	The program that interprets commands typed into a terminal. In your terminal you can type `echo $SHELL` to check which shell is loaded.
Bash	A common shell program.
Command line*	The location where commands are type within the terminal window
Command prompt*	The information displayed on the command line before the cursor
Command	A set of instructions (code) that can be interpreted by the shell
Program	A set of code designed for a specific task. Similar to command, but more general (i.e., a program is not limited to the shell's scripting syntax).
Library	Files containing general code blocks that can be used widely by different programs
Dependency	A program or library that is required for another program to run.
Package	A program and all it's dependencies.
Module	Similar to library.
Argument	Options specified in the command line when running a program or command.
Directory	A named location on a computer that contains files and/or other directories.
File	A named location on a computer that contains data, commonly in the form of plain text.
Script	A type of file whose contents are code or commands to be interpreted by the shell or another interpreter (e.g., Python).
Repository/Repo	From git, a directory of files, possibly including code, documentation, or data.
File system	The way in which files and directories are organized in a nesting, tree-like structure.
Root	The lowest level in the file system in which all directories and files are stored. Critical system files are stored close to the root. Usually located at `/` and usually only accessible by the computer's administrators.
Home	A user's home directory is where that user has read, write, and execute permissions within the file system.
Path	The location of a file or directory within the file system, with directories separated by slash characters (`/`).
Absolute path	The full name of a file or directory that includes all directories and sub-directories starting from the root of the file system to the specified file or directory.
Relative path	The name of a file or directory that includes all directories and sub-directories starting from the user's current location.
Operating system	The software that interfaces between the computer's hardware and other user facing software.
Server	A computer setup to have users connect and work on it remotely, usually with more resources than personal computers to accommodate more resource intensive commands and multiple users.
Cluster	An interconnected collection of servers setup such that users can connect to one and specify high resource commands to run which are distributed to the others based on available resources.
Node	One computer within a cluster.
Login node	The node within the cluster which users connect to and interact with.
Head node	The node within the cluster which handles job scheduling and resource allocation.
Job	A submitted command or set of commands passed from the user to the job scheduler on a cluster.
Job Scheduler	Software that runs on a cluster that monitors and configures resource usage. Users directly interact with the job scheduler to submit jobs to be run and the scheduler delegates when and on what node they will be run. On our Cannon cluster, our job scheduling software is SLURM.

Definitions of frequently used terms in phylogenetics

Term	Definition
Comparative genomics	The field in which DNA sequences from different species are compared to each other, often in the context of their phylogeny, to identify variants in the sequence that may play a role in adaptation.
Homolog	Sequences that share ancestry
Orthologs	Homologs that descend from a speciation event
Paralogs	Homologs that descend from a duplication event
Alignment	When DNA or amino acid sequences are arrayed in a matrix such that rows represent different sequences and columns represent individual sites, with adjustment for insertions and deletions by gap insertion. In comparative genomics, alignments are usually performed among sequences that share ancestry (are homologs).
Whole genome alignment	An alignment of complete genome sequences with others
Locus	A general name for any discrete region of the genome. Plural: loci.
Conserved non-coding elements	Regions of the genome that do not code for proteins but are nevertheless conserved relative to the rest of the genome. Examples include regulatory regions or non-coding RNAs.
Phylogeny	A branching representation of the evolution of species, genomes, or individual loci
Newick format	A way to represent phylogenies in text format using nested parentheses, e.g. ((A,B),C);
Node	The joining point of two branches in a phylogeny, representing the ancestor of the two descending branches.
Branch length	A number that represents the length of a branch in a phylogeny, commonly in units of either relative number of substitutions that occurred on that branch or absolute time. In Newick format branch lengths are indicated by a colon and a number, e.g. ((A:1,B:2):3,C:2);
Substitution rate	The rate at which mutations fix in a population, becoming the dominant allele.
4-fold degenerate sites	Sites within the protein coding region of a gene that result in the same amino acid translation regardless of the nucleotide present. Often used to estimate the neutral rate of evolution as these sites are assumed to be unconstrained by selection.
Species tree	A phylogeny inferred for a set of species by combining information across the genome in order to represent the history of speciation. Common methods for species tree inference are maximum likelihood on concatenated gene alignments or summaries of gene trees.
Gene tree	A phylogeny inferred for a set of species from the alignment of a single protein coding gene. Gene trees can capture speciation history as well as gene duplication events. Gene tree is also an umbrella term to refer to phylogenies inferred from the alignment of ANY short genomic region (e.g. conserved non-coding elements).
Phylogenetic discordance	Processes like ancestral polymorphisms and introgression, combined with recombination over generations lead to different regions of the genome having different evolutionary histories. As a result, a phylogeny inferred from one region of the genome may be different than from another region (i.e. gene trees may differ from one another). This also refers to an individual gene tree differing from that of a species tree.
Concordance factors	A method to assess the underlying concordance of a phylogeny by measuring how many gene trees or alignment sites agree with the inferred topology.

Definitions of frequently used terms in the context of PhyloAcc

Term	Definition
.mod file	The output file from phyloFit that contains the transition rate matrix for the probabilty of each nucleotide to change to another as well as the input phylogeny with branch lengths estimated from likely neutrally evolving sites (e.g. 4-fold degenerate sites).
Model	The probability of three models is calculated for each locus, M0, M1, and M2.
M0	A model which constrains substitution rates to either background/neutral or conserved rates for all branches.
M1 (target model)	A model which allows a pre-specified set of target branches to have acclerated substitution rates.
M2 (full or free model)	A model which allows any branch to have any substitution rate (background, conserved, or accelerated).
Bayes factor	The ratio of the marginal probabilities of two models to assess which model better fits the data.
BF1 or logBF1	The Bayes factor comparing M1 to M0. logBF1 = P(M1) / P(M0)
BF2 or logBF2	The Bayes factor comparing M1 to M2. logBF1 = P(M1) / P(M2)
BF3 or logBF3	The Bayes factor comparing M2 to M0. logBF1 = P(M2) / P(M0)
Bayes factor cut-off	A number specified such that a model is considered supported relative to another if the Bayes factor is above it.
Conservation state	For any given locus, each branch in the input phylogeny is estimated to be in one of 3 states regarding its substitution rate: conserved, background (neutral), or accelerated.
Z matrix	The matrix that represents conservation states for each branch in the input phylogeny. Conservation states are coded as 0 = background, 1 = conserved, and 2 = accelerated.
Z score	The coded conservation state for any given branch in a locus (e.g. 0, 1, or 2).

Common bioinformatics file formats

Format	Use	Link	Specs
.txt	General plain text. May be formatted in some way that is unspecified by this extension (i.e., a .txt file could contain comma separated values).	Wikipedia	NA
.csv	General data storage with rows and columns. Columns are separated by commas.	Wikipedia	Link
.tab/.tsv	General data storage with rows and columns. Columns are separated by tabs.	Wikipedia	NA
FASTA	Stores sequence data.	Wikipedia	NA
FASTQ	Stores sequence data and quality scores.	Wikipedia	Link
SAM	Sequence Alignment Map format. Stores information about reads mapped to a reference genome.	Wikipedia	Link
BAM	Binary Alignment Map format. The compressed binary version of SAM format.	Wikipedia	Link
CRAM	Another compressed format to store read mapping information.	Wikipedia	Link
VCF	Variant Call Format. Used to store information about variants inferred for a given sample(s). .vcf files are a specific type of tab delmited format.	Wikipedia	Link
BCF	Binary variant Call Format. The binary compressed verion of a VCF.	Wikipedia	Link
BED	Stores coordinates of regions of interest. .bed files are a specific type of tab delimited format.	Wikipedia	Link
GFF	Stores annotation information from a genome. .gff files are a specific type of tab delimited format.	Wikipedia	Link
GTF	Stores annotation information from a genome. GTF is an earlier version of GFF, but still commonly used, notably by the Ensembl database. .gtf files are a specific type of tab delmited format.	Wikipedia	Link