Sometimes the hardest part of learning a new topic is learning the terminology or jargon that those within the community commonly use. Here is a table of some terms that are common, but may be unfamiliar to someone new to the field of data science. Some of these are my attempt to define abstract terms. If you want any terms defined or added to the list, or you feel the definitions are inaccurate, please contact me.
Importantly, while some terms technically have different meanings, they are often used synonymously. I have tried to indicate these terms with matching number of asterisks.
Term | Definition |
---|---|
Terminal* | The window in which you type commands in to be interpreted by the shell |
Console* | Similar to terminal, but full screen with no graphical component. |
Shell* | The program that interprets commands typed into a terminal. In your terminal you can type echo $SHELL to check which shell is loaded. |
Bash | A common shell program. |
Command line* | The location where commands are type within the terminal window |
Command prompt* | The information displayed on the command line before the cursor |
Command | A set of instructions (code) that can be interpreted by the shell |
Program | A set of code designed for a specific task. Similar to command, but more general (i.e., a program is not limited to the shell's scripting syntax). |
Library | Files containing general code blocks that can be used widely by different programs |
Dependency | A program or library that is required for another program to run. |
Package | A program and all it's dependencies. |
Module | Similar to library. |
Argument | Options specified in the command line when running a program or command. |
Directory | A named location on a computer that contains files and/or other directories. |
File | A named location on a computer that contains data, commonly in the form of plain text. |
Script | A type of file whose contents are code or commands to be interpreted by the shell or another interpreter (e.g., Python). |
Repository/Repo | From git, a directory of files, possibly including code, documentation, or data. |
File system | The way in which files and directories are organized in a nesting, tree-like structure. |
Root | The lowest level in the file system in which all directories and files are stored. Critical system files are stored close to the root. Usually located at / and usually only accessible by the computer's administrators. |
Home | A user's home directory is where that user has read, write, and execute permissions within the file system. |
Path | The location of a file or directory within the file system, with directories separated by slash characters (/ ). |
Absolute path | The full name of a file or directory that includes all directories and sub-directories starting from the root of the file system to the specified file or directory. |
Relative path | The name of a file or directory that includes all directories and sub-directories starting from the user's current location. |
Operating system | The software that interfaces between the computer's hardware and other user facing software. |
Server | A computer setup to have users connect and work on it remotely, usually with more resources than personal computers to accommodate more resource intensive commands and multiple users. |
Cluster | An interconnected collection of servers setup such that users can connect to one and specify high resource commands to run which are distributed to the others based on available resources. |
Node | One computer within a cluster. |
Login node | The node within the cluster which users connect to and interact with. |
Head node | The node within the cluster which handles job scheduling and resource allocation. |
Job | A submitted command or set of commands passed from the user to the job scheduler on a cluster. |
Job Scheduler | Software that runs on a cluster that monitors and configures resource usage. Users directly interact with the job scheduler to submit jobs to be run and the scheduler delegates when and on what node they will be run. On our Cannon cluster, our job scheduling software is SLURM. |
Term | Definition |
---|---|
Comparative genomics | The field in which DNA sequences from different species are compared to each other, often in the context of their phylogeny, to identify variants in the sequence that may play a role in adaptation. |
Homolog | Sequences that share ancestry |
Orthologs | Homologs that descend from a speciation event |
Paralogs | Homologs that descend from a duplication event |
Alignment | When DNA or amino acid sequences are arrayed in a matrix such that rows represent different sequences and columns represent individual sites, with adjustment for insertions and deletions by gap insertion. In comparative genomics, alignments are usually performed among sequences that share ancestry (are homologs). |
Whole genome alignment | An alignment of complete genome sequences with others |
Locus | A general name for any discrete region of the genome. Plural: loci. |
Conserved non-coding elements | Regions of the genome that do not code for proteins but are nevertheless conserved relative to the rest of the genome. Examples include regulatory regions or non-coding RNAs. |
Phylogeny | A branching representation of the evolution of species, genomes, or individual loci |
Newick format | A way to represent phylogenies in text format using nested parentheses, e.g. ((A,B),C); |
Node | The joining point of two branches in a phylogeny, representing the ancestor of the two descending branches. |
Branch length | A number that represents the length of a branch in a phylogeny, commonly in units of either relative number of substitutions that occurred on that branch or absolute time. In Newick format branch lengths are indicated by a colon and a number, e.g. ((A:1,B:2):3,C:2); |
Substitution rate | The rate at which mutations fix in a population, becoming the dominant allele. |
4-fold degenerate sites | Sites within the protein coding region of a gene that result in the same amino acid translation regardless of the nucleotide present. Often used to estimate the neutral rate of evolution as these sites are assumed to be unconstrained by selection. |
Species tree | A phylogeny inferred for a set of species by combining information across the genome in order to represent the history of speciation. Common methods for species tree inference are maximum likelihood on concatenated gene alignments or summaries of gene trees. |
Gene tree | A phylogeny inferred for a set of species from the alignment of a single protein coding gene. Gene trees can capture speciation history as well as gene duplication events. Gene tree is also an umbrella term to refer to phylogenies inferred from the alignment of ANY short genomic region (e.g. conserved non-coding elements). |
Phylogenetic discordance | Processes like ancestral polymorphisms and introgression, combined with recombination over generations lead to different regions of the genome having different evolutionary histories. As a result, a phylogeny inferred from one region of the genome may be different than from another region (i.e. gene trees may differ from one another). This also refers to an individual gene tree differing from that of a species tree. |
Concordance factors | A method to assess the underlying concordance of a phylogeny by measuring how many gene trees or alignment sites agree with the inferred topology. |
Term | Definition |
---|---|
.mod file | The output file from phyloFit that contains the transition rate matrix for the probabilty of each nucleotide to change to another as well as the input phylogeny with branch lengths estimated from likely neutrally evolving sites (e.g. 4-fold degenerate sites). |
Model | The probability of three models is calculated for each locus, M0, M1, and M2. |
M0 | A model which constrains substitution rates to either background/neutral or conserved rates for all branches. |
M1 (target model) | A model which allows a pre-specified set of target branches to have acclerated substitution rates. |
M2 (full or free model) | A model which allows any branch to have any substitution rate (background, conserved, or accelerated). |
Bayes factor | The ratio of the marginal probabilities of two models to assess which model better fits the data. |
BF1 or logBF1 | The Bayes factor comparing M1 to M0. logBF1 = P(M1) / P(M0) |
BF2 or logBF2 | The Bayes factor comparing M1 to M2. logBF1 = P(M1) / P(M2) |
BF3 or logBF3 | The Bayes factor comparing M2 to M0. logBF1 = P(M2) / P(M0) |
Bayes factor cut-off | A number specified such that a model is considered supported relative to another if the Bayes factor is above it. |
Conservation state | For any given locus, each branch in the input phylogeny is estimated to be in one of 3 states regarding its substitution rate: conserved, background (neutral), or accelerated. |
Z matrix | The matrix that represents conservation states for each branch in the input phylogeny. Conservation states are coded as 0 = background, 1 = conserved, and 2 = accelerated. |
Z score | The coded conservation state for any given branch in a locus (e.g. 0, 1, or 2). |
Format | Use | Link | Specs |
---|---|---|---|
.txt | General plain text. May be formatted in some way that is unspecified by this extension (i.e., a .txt file could contain comma separated values). | Wikipedia | NA |
.csv | General data storage with rows and columns. Columns are separated by commas. | Wikipedia | Link |
.tab/.tsv | General data storage with rows and columns. Columns are separated by tabs. | Wikipedia | NA |
FASTA | Stores sequence data. | Wikipedia | NA |
FASTQ | Stores sequence data and quality scores. | Wikipedia | Link |
SAM | Sequence Alignment Map format. Stores information about reads mapped to a reference genome. | Wikipedia | Link |
BAM | Binary Alignment Map format. The compressed binary version of SAM format. | Wikipedia | Link |
CRAM | Another compressed format to store read mapping information. | Wikipedia | Link |
VCF | Variant Call Format. Used to store information about variants inferred for a given sample(s). .vcf files are a specific type of tab delmited format. | Wikipedia | Link |
BCF | Binary variant Call Format. The binary compressed verion of a VCF. | Wikipedia | Link |
BED | Stores coordinates of regions of interest. .bed files are a specific type of tab delimited format. | Wikipedia | Link |
GFF | Stores annotation information from a genome. .gff files are a specific type of tab delimited format. | Wikipedia | Link |
GTF | Stores annotation information from a genome. GTF is an earlier version of GFF, but still commonly used, notably by the Ensembl database. .gtf files are a specific type of tab delmited format. | Wikipedia | Link |