TTree

Introduction
Given a reference set of many (e.g. <25,000) transcripts it is difficult to visualize internal relationships between all samples (possibly many e.g. <100) of an analysis using normalized gene expression patterns as a metric. TTree is a tool to aid in this task. The user defines a length range of transcripts to keep as well as a normalized read count range. Subsequently for every pair of samples, absolute differences in normalized read counts are summed across remaining transcripts resulting in a pairwise matrix indicative of general expression differences within that range. These can be visualized on standard tree viewing software such as FigTree. The result is similar to a PCA plot (see our paper) but provides more information on the internal relationships between transcripts. The method was designed and implemented by Diana Lobo, Raquel Godinho and John Archer.

Download and Installation
An executable jar file as well as the corresponding code and example usage scenario is available within the software directory ttree.zip.

Example usage

  1. Download the compressed file called ttree.zip.

  2. Unzip the folder. The contents are as follows:
    1. code - contains the source code of the tool.
    2. jar – contains an executable jar file.
    3. example – contains and example config file and some example input data..

      Example Input Data Example input data is contained in the directory ./example/data. It consists of one file per sample containing un-normalized read counts that have been mapped to a common reference transcriptome. In this example the reference transcriptome used was dog cDNA (Hoeppner et al., 2014) and the samples consist of read counts from four dogs (Dg), four wolves (Wf), four tame foxes (FT) and four aggressive foxes (FA). Additionally a single file containing two columns is required where first column contains the name of each transcript in the reference and the second contains the corresponding length, see example file contig_length.txt. This file must be generated by the user.

      Config File In the config file:
      The first line is the path to where input data files reside. In this example it is: /user_path/example/data/

      The second line is the name of the file containing the lengths of the transcripts. This file needs to be created by the user and is required to be within the same folder as the example data.

      The third and fourth lines identifies the upper and lower normalized read count values. Transcripts outside of these specified values will be removed prior to creating the tree.

      These values must be obtained prior to tree creation by pre-running the software with a -1 value on the 3rd line. In this case the software will output all normalized read count values for each transcript across each dataset to a file called crossSampleNormalizedReadCounts.txt. These will be sorted in ascending order and associated with their corresponding rounded percentile (second column). Using these percentiles the user can select the band for which they want to keep transcripts, for example between the 40th and 60th percentile.

      The fifth line indicates the minimum number of samples that a transcript is required to have read count data for in order to maintain the transcript.

      The sixth and seventh lines are where the user specifies the range of length for which transcripts will be maintained. Transcripts outside of this range will be removed from the tree creation process. On pre running of the software (see above) a file called contig_length_percentiles.txt will be generated which will aid the user in selecting values to define range in a similar manner to that for the normalized read counts.

      The eight line of the config file defines headers for two columns that make up all subsequent lines. This is where files containing count data associated with the transcripts within the reference along with their associated name to be used on the tree are placed. The first column contains the file name. All files must reside in the specified path on line 1. The second column is the name used within the .ph tree file.


  3. Once the config file has been set up and the data is in place the software can be run with the following command:

    java -jar TTree.jar path-to-config-file

    Note: in this example we have preselected the normalized read count range and the transcript length range. Normally the user will need to pre run the script as described above in the config file section in order to obtain this threshold.

  4. The output tree.ph file will be created within the folder containing the input data. This file can be visualized within external tree viewing software, for example FigTree.

Requirements
The jar file requires the java runtime environment version 1.8 or higher to be installed.

References

  1. Hoeppner, M. P., Lundquist, A., Pirun, M., Meadows, J. R. S., Zamani, N., Johnson, J., … Grabherr, M. G. (2014). An improved canine genome and a comprehensive catalogue of coding genes and non-coding transcripts. PLoS ONE, 9(3). doi:10.1371/journal.pone.0091172