TVScript

Introduction
TVscript is an implementation of a method that allows for the removal transcripts from an analysis that display high intra-condition variation within normalized read count profiles. It has been designed in order to allow for the incorporation of datasets derived from different studies into a single differential expression analysis pipeline. The method calculates pairwise distances between read counts for each transcript across all samples for a given condition and uses the variation present within these distances in order to filter transcripts based on a selected threshold from the overall distribution. Full details and a case study involving transcriptomics data associated with canids domestication are available in our submitted paper (link available shortly): "Taming the wild: a new tool for cross study RNA-seq analysis applied to behavioral traits in dogs and tame foxes". The method was designed and implemented by Diana Lobo, Raquel Godinho and John Archer.

Download and Installation
An executable jar file as well as the corresponding code and example usage scenario is available within the software directory tvscript.zip.

Example usage

  1. Download the compressed file called tvscript.zip.

  2. Unzip the file. This will produce three subdirectories:
    1. code - contains the source code of the tool.
    2. jar - contains an executable jar file that can be readily used.
    3. example - test input data and example config.txt file.

      The Example Input Data Example input data is contained in the directory ./example/data. It consists of one file per sample containing un-normalized read counts that have been mapped to a common reference transcriptome. In this example the reference transcriptome used was dog cDNA (Hoeppner et al., 2014) and the samples consist of read counts from four dog datasets and four wolf datasets, corresponding to the conditions that will be eventually used for differential expression. See our paper for more details on these datasets. Additionally a single file containing two columns is required where first column contains the name of each transcript in the reference and the second contains the corresponding length, see example file contig_length.txt. This file must be generated by the user.

      The Config File In the config file: The first line is the path to where input data files reside. In this example it is: /user_path/example/data/

      The second line is the name of the file containing the lengths of the transcripts. This file needs to be created by the user and is required to be within the same folder as the example data.

      The third line identifies the intra condition variance threshold above which transcripts will be removed. This value is obtained prior to analysis by pre-running the software with a -1 value on this 3rd line. In this case the software will output the distribution of intra condition variance values and their associated percentile, in a file called intraContigVarianceData.txt. From this list the threshold value can be selected from the first column using the 3rd as a percentile guide. The second column identifies the conditions that were defined by the user in subsequent lines of the config file. The selected threshold can then be used for execution of the script, after which transcripts will be filtered based on intra condition variation relative to the selected threshold.

      The fourth line of the config file defines headers for three columns that make up all subsequent lines. This is where files containing count data associated with the transcripts within the reference along with their associated condition are placed. The first column contains the fill name. All files must reside in the specified path on line 1. The second column is the name that will be associated with the output file after filtering. The third column takes a value of either 1 or 2 and is dependent on what condition the user allocates to the particular dataset.


  3. Once the config file has been set up and the data is in place the software can be run with the following command:

    java -jar TVscript.jar path-to-config-file

    Note: if this example we have preselected the threshold to correspond to the 95th percentile of variance values. This means that the 5% of transcripts associated with the highest intra-condition variances will be removed after filtering. Normally the user will need to pre run the script as described above in the config file section in order to obtain this threshold.

  4. Output files, containing transcripts displaying intra-condition variation below the selected threshold, will be created within the folder containing the input data using the specified names within the config file. These output files can then be used in external software, for example deseq2 (Love et al., 2014) for differential expression analysis.

Requirements
The jar file requires the java runtime environment version 1.8 or higher to be installed.

References

  1. Hoeppner, M. P., Lundquist, A., Pirun, M., Meadows, J. R. S., Zamani, N., Johnson, J., … Grabherr, M. G. (2014). An improved canine genome and a comprehensive catalogue of coding genes and non-coding transcripts. PLoS ONE, 9(3). doi:10.1371/journal.pone.0091172
  2. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. doi:10.1186/s13059-014-0550-8