README

This section describes the program and its usage. For background about the algorithm see the About section.

GRAMPA: Gene-tree Reconciliation Algorithm with MUL-trees for Polyploid Analysis

Installation

Clone or download the github repo: GRAMPA github

The only dependency is Python 3 or higher. You may want to add the GRAMPA folder to your $PATH variable for ease of use!

Usage

The first thing you should do when you try to run GRAMPA is make sure everything is working with some test files. You can do this easily by running the --tests command:

python grampa.py --tests

If all tests pass, then you're good to go! Basic usage in a real case would be:

python grampa.py -s [species tree file] -g [gene trees file] -o [output directory]

This would perform a full search for the optimal (lowest scoring) MUL-tree on the input species tree.

Input

There are two main inputs for the program:

  1. A file or string containing a Newick formatted rooted species tree (-s). This can be a singly labeled tree or a MUL-tree.
  2. A file containing one or more Newick formatted rooted gene trees (one tree per line) (-g).

Important: the tip labels of the gene tree MUST be formatted such that they end with _[species label], where [species label] corresponds to a tip label in the species tree.

Output

All output files will be placed in the directory specified with -o

GRAMPA creates four output files, a log file, and a filtered tree file (if necessary).

GRAMPA also creates a directory within the output directory called groups_dir. This just stores the gene tree groupings for each MUL-tree (in pickled format) so GRAMPA doesn't eat up a lot of RAM during reconciliations. This can be ignored/deleted

  1. grampa-scores.txt

    This is the main output file and contains the total reconciliation score for each MUL-tree considered, sorted in ascending order.

    The first line of this file contains the headers, defined as follows for each subsequent row:

    mul.tree h1.node h2.node score mul.tree
    The ID of the MUL-tree The H1 node in the species tree for the current MUL-tree The H2 node in the species tree for the current MUL-tree The total parsimony score for the current MUL-tree The Newick formatted tree string for the MUL-tree, with hybrid clades indicated with *

    Please note that the input singly-labeled species tree always has the ID of 0

  2. grampa-detailed.txt

    The secondary output file contains detailed output describing the reconciliation scores from each gene tree to the lowest scoring MUL-tree.

    The first line of this file contains the headers, defined as follows for each subsequent row:

    mul.tree gene.tree dups losses total score
    The ID of the MUL-tree The ID of the gene tree being reconciled to the MUL-tree The number of duplications on this gene tree given this MUL-tree The number of losses on this gene tree given this MUL-tree The sum of dups and losses for this gene tree and MUL-tree

    Note that the lowest score for some GT/MT combos can have multiple maps. In these cases, we report all possible scores.

  3. grampa-dup-counts.txt

    For the 6 lowest scoring MUL-trees, GRAMPA counts the number of duplications along each branch in the MUL-tree summed over all gene trees.

    The first line of this file contains the headers, defined as follows for each subsequent row:

    mul.tree node dups
    The ID of the MUL-tree The node ID for the current MUL-tree The total number of duplications over all gene tres along the branch above the node in the MUL-tree
  4. grampa-checknums.txt

    GRAMPA must calculate how many combinations of maps there are for each gene-tree/MUL-tree pair and filter out those that are over the group cap in any combo before any reconciliations can be done. This filtering ensures that all MUL-trees are reconciled to the same set of gene-trees. The number of groups for each gene-tree/MUL-tree is recorded in this file.

    The first line of this file contains the headers, defined as follows for each subsequent row:

    mul.tree gene.tree groups fixed combinations over.cap.filtered
    The ID of the MUL-tree The ID of the gene tree to be reconciled to the MUL-tree The number of distinct hybrid clades in the gene tree The number of hybrid clades in the gene tree that also group with a sister species from the singly-labeled tree The total number of mappings to try for the gene tree with this MUL-tree Either Y or N to indicate whether the number of groups exceeds the number set with -c
  5. grampa-trees-filtered.txt

    A text file with the gene trees used for this GRAMPA run, after filtering by the group cap. One tree per line.

  6. grampa.log

    A log file containing run time information and a summary of the lowest scoring MUL-tree.

Options Table


Option Description
-s A file or string containing a bifurcating, rooted species tree in Newick format. This tree can either be singly-labeled or MUL.
-g A file containing one or more bifurcating, rooted, Newick formatted gene trees. Gene trees with polytomies are currently not supported and will be automatically filtered from the analysis.
-h1 A space separated list of nodes to search as the polyploid clade. If nothing is entered all nodes will be considered.
-h2 A space separated list of nodes to search as possible parental lineages for all nodes specified with -h1. If nothing is entered all possible nodes for the current h1 will be considered.
-c The maximum number of initial groups to consider for any gene tree. Default: 8, Max value: 18
-o Output directory name. If the directory is not present, GRAMPA will created it for you.
-f By default, all output files created by GRAMPA will have the prefix 'grampa-'. You can specify a different prefix with this option.
-v Control the amount of output printed to the screen. 0: print nothing. 1: print only some info at the start. 2: print all log info to screen. 3 (default): print all log info to the screen as well as progress updates for certain steps.
-p The number of processes GRAMPA should use for reconciliations.
--multree Set this flag if your input species tree is a MUL-tree.
--labeltree The program will simply label your input species tree.
--numtrees The program will simply count the number of possible MUL-trees given -s. -h1 and -h2 may also be supplied.
--buildmultrees Build MUL-trees given -s, -h1, and -h2 and write them to the log file.
--checknums If this flag is entered, the program will just calculate the number of groups per gene tree and exit. No reconciliations will be done.
--st-only Only do reconciliations to the input singly-labeled species tree.
--no-st Skip doing reconciliations to the input singly-labled species tree.
--maps Output the node maps for each reconciliation in addition to the scores. The maps will be placed in the detailed output file.
--version Print out version info and exit.
--tests Run the tests script.

Detailed options


-s : A rooted, Newick formatted species tree. This tree can be singly-labeled or MUL.

    The tree can be in a file, in which case you enter the file name here, or you can simply paste the tree string into the command line.

    Entering a singly-labeled tree means you wish to search for the most parsimonious polyploidy scenario. GRAMPA will build MUL-trees based on this singly-labeled tree and calculate reconciliation scores. You can specify the range of MUL-trees to build with the -h1 and -h2 options.

    Example singly-labeled species tree:

    (((a,(x,(y,z))),b),(c,d))

    Entering a MUL-tree is the equivalent of entering a singly-labeled tree and specifying a single H1 and single H2 node. It represents a single scenario of polyploidy and can be used if you wish to count the number of duplications and losses on gene trees given that scenario.

    NOTE: If a MUL-tree is entered, the --multree flag must be set.

    Example MUL-tree:

    ((((a,(x,(y,z))),b),(x,(y,z))),(c,d))

-g : A file containing newick formatted gene trees.

    This file should contain one or more bifurcating, Newick formatted gene trees, with one tree per line in the file. Currentky, gene trees with unresolved nodes (polytomies) are not supported as they falsely increase the number of losses counted in that tree.

    The tip labels in the gene trees must end with _[species label] where [species label] matches a tip label in the species tree This is necessary so GRAMPA can initialize the mappings correctly.

    Alternatively, if you wish to reconcile to only a single gene tree, you can simply paste the tree string into the command line.

-h1 and -h2 : GRAMPA's search parameters.

    H1 and H2 are nodes in the singly-labeled species tree that define how to build a MUL-tree. H1 is the node that represents the polyploid clade. The subtree rooted at H1 and the branch that H1 subtends will be copied onto the branch that H2 subtends:

    In the above example, H1 is node 2 and H2 is node 5 in the singly-labeled tree. This leads to the MUL-tree on the right.

    H1 and H2 can be input in 2 different, equivalent ways:

    -h1 "2" -h2 "5" and -h1 "x,y,z" -h2 "c,d"

    The first way relies on internal node labels. To label your species tree, use the --labeltree option.
    IMPORTANT: For now, only use node labels as specified by --labeltree. Custom labels will not work.

    The second way uses a list of the species that define that node. Species must be comma delimited.

    H2 cannot be located below H1 in the species tree! If this occurs, GRAMPA will just tell you that it's not possible and move on.

    Multiple H1 and H2 nodes can be entered as a space delimited list:

    -h1 "2 3" -h2 "5 6" and -h1 "x,y,z a,x,y,z" -h2 "c,d a,b,c,d,x,y,z" are equivalent.

    Entering this means that GRAMPA will first set H1 as node 2 and try both nodes 5 and 6 as H2. Then H1 will be set to node 3 and will try nodes 5 and 6 as H2.

    If -h1 and -h2 are not specified, GRAMPA will try all possible node combinations of H1 and H2!

-c : The group cap

    GRAMPA uses the standard LCA reconciliation algorithm on MUL-trees, meaning that some genes have more than one possible mapping. We get around this by trying ALL possible initial mappings and picking the one with the lowest score. This works, but also means our program has an exponential runtime based on the number of genes from polyploid species in any given gene tree. We get around this in several ways by collapsing and fixing groups (see paper), but there can still be lots of groups. This parameter sets the maximum number of groups to consider for any gene tree. If a gene tree has more than this number of groups, it will be skipped.

    Default is 8 groups, with a max setting of 18.

-o : Output directory

    Grampa creates several output files, so it is easiest just to place them all in a single directory. That directory can be specified with this option, and will be created for you if it doesn't exist. If this option is not specified, the default output directory is "grampa_[date]-[time]".

-f : Output file prefix

    By default, all output files created by GRAMPA will have the prefix 'grampa-'. You can specify a different prefix with this option. For example, a run with -f test will generate the following output files, all within the output directory:

    test_out.txt, test_det.txt, test_checknums.txt

--multree : Input MUL-tree flag

    GRAMPA can accept both singly-labeled and MUL-trees as input. If your input species tree (-s) is a MUL-tree, you must set this flag so GRAMPA knows to read it as a MUL-tree. A MUL-tree represents a single possible polyploid scenario and it is equivalent to entering a singly-labeled tree with a single H1 and H2 node specified.

--labeltree : Species tree labeling

    This option can be used in conjunction with -s to simply add internal node labels to a species tree and print it to the screen. For example, if the file species.tree contains the following tree:

    (((a,(x,(y,z))),b),(c,d))

    Then the command:

    python grampa.py -s species.tree --labeltree

    Will simply print this to the screen as output:

    (((a,(x,(y,z)<1>)<2>)<3>,b)<4>,(c,d)<5>)<6>

--numtrees : Counting MUL-trees

    This option quickly calculates how many MUL-trees are to be built with a given H1 and H2 set. If neither H1 or H2 are set it will display the total number of MUL-trees possible for the input species tree. This information is printed to the screen.

--buildmultrees : Building MUL-trees

    This option can be used with -s, -h1, and -h2 to build MUL-trees from a standard species tree. For example, if the file species.tree contains the following tree:

    (((a,(x,(y,z))),b),(c,d))

    Then the command:

    python grampa.py -s species.tree -h1 "2" -h2 "4" -o multree_ex --buildmultrees

    Will yield the following output in the main output file (multree_ex/grampa-out.txt):

    ((((a,(x+,(y+,z+)<1>)<2>)<3>,b)<4>,(x*,(y*,z*)<5>)<6>)<7>,(c,d)<8>)<9>

    The MUL-trees are written to the log file.

--checknums : Group counting

    With this set, the program will run normally with the specified options, except no reconciliations will be done. Instead, only the checknums output file will be created and will contain the number of polyploid groups for each gene tree. Use this to decide the best setting for -c.

--st-only : Reconciling to input tree only

    By default, GRAMPA reconciles the gene trees to all specified MUL-trees as well as the singly-labeled input species tree. Set this option to ONLY do reconciliations to the singly-labeled input species tree.

--no-st : Exclude reconciling to the input tree

    By default, GRAMPA reconciles the gene trees to all specified MUL-trees as well as the singly-labeled input species tree. Set this option to SKIP reconciliations to the singly-labeled input species tree.

--maps : Output node mappings

    This option adds a column to the grampa-detailed.txt with the actual LCA node mappings for each gene tree and MUL-tree combo. The column contains a Newick formatted version of the gene tree with nodes labeled as follows:

    Node[Map-Dups]

    Where Map indicates the node in the MUL-tree that this gene tree node maps to and Dups the number of duplications this mapping incurs. These trees can be rendered with a tree viewer such as SeaView or FigTree.