This repository is an R-package implementation of the original command-line scripts available at the upstream repository to streamline installation and dependency management.


Watershed is an unsupervised probabilistic framework that integrates genomic annotations and RNA-seq outlier calls to identify the probability a rare variant has a functional effect on a particular RNA-seq outlier phenotype (examples of outlier phenotypes can be, but are not limited to total expression, splicing, or ASE). Watershed extends our previous model RIVER (which can also be run via this package) by incorporating information from multiple outlier phenotypes into one model, where predictions for functional effects in one outlier phenotype are informed by observed outlier calls in another phenotype. Please see our publication in Science for more details.

Installation

Install this R package from the GitHub repository:

if (!require("devtools", quietly = TRUE)){
  install.packages("devtools")
}
devtools::install_github("nicolerg/WatershedR")

Input data

For details about the input file format, see the docs:

library(WatershedR)
?evaluate_watershed

An example input file with 18 genomic annotations and 3 outlier p-values can be found in example_data/watershed_example_data.txt.

Another example input file with 18 genomic annotations and 1 outlier p-value can be found in example_data/river_example_data_pheno_1.txt.

Running Watershed

This package provides two functions useful to users looking to apply Watershed to their data:

  1. evaluate_watershed(): This function is used to train a Watershed model on non-N2 pairs and evaluate model on held-out N2-pairs. This allows the user to get an idea of the accuracy of Watershed applied to their data.

  2. predict_watershed(): This function trains a Watershed model on training data and predicts Watershed posterior probabilities (using Watershed parameters optimized in training) on all gene-individual in a much larger prediction data set.

Both of these functions can be run with three different models:

  • Watershed_exact: Watershed where parameters are optimized via exact inference (tractable and recommended when the number of dimensions (E) is small. A general rule of thumb is if the number of dimensions (E) is less than equal to 4, exact inference should be used).

  • Watershed_approximate: Watershed where parameters are optimized using approximate inference. This approach is tractable when the number of dimensions (E) is large. For example, we used this to model the related outlier signals from 49 tissues (see our publication).

  • RIVER: A previously published method. Used if the number of dimensions (E) is 1.

See the function documentation for details about the outputs. See the Get Started vignette for examples of how to run these functions.

Citation

If you use this R package, please site our publication:

N.M. Ferraro, B.J. Strober, J. Einson, N.S. Abell, F. Aguet, A.N. Barbeira, M. Brandt, M. Bucan, S.E. Castel, J.R. Davis, et al., TOPMed Lipids Working Group, GTEx Consortium, Transcriptomic signatures across human tissues identify functional rare genetic variation. Science 369, eaaz5900 (2020).