Watershed is an unsupervised probabilistic framework that integrates genomic annotations and RNA-seq outlier calls to identify the probability a rare variant has a functional effect on a particular RNA-seq outlier phenotype (examples of outlier phenotypes can be, but are not limited to total expression, splicing, or ASE). Watershed extends our previous model RIVER (which can also be run via this package) by incorporating information from multiple outlier phenotypes into one model, where predictions for functional effects in one outlier phenotype are informed by observed outlier calls in another phenotype. Please see our publication in Science for more details.
See evaluate_watershed()
for details of the input file
format. For all the examples in this vignette, we will use the sample
input data provided here: https://raw.githubusercontent.com/BennyStrobes/Watershed/master/example_data/watershed_example_data.txt.
# For all examples, use example data that has 3 E outlier p-value columns,
# which corresponds to number_of_dimensions = 3
input = paste0("https://raw.githubusercontent.com/BennyStrobes/Watershed/",
"master/example_data/watershed_example_data.txt")
This package provides two user-level functions:
evaluate_watershed()
and predict_watershed()
.
Here we provide examples of how to run each of these functions with the
recommended default parameters.
This function is used to train a Watershed model on non-N2 pairs and evaluate model on held-out N2-pairs. This allows the user to get an idea of the accuracy of Watershed applied to their data.
res = evaluate_watershed(input_file = input,
model_name = "Watershed_approximate",
number_of_dimensions = 3,
output_prefix = "~/watershed_approximate_n3")
res = evaluate_watershed(input_file = input,
model_name = "Watershed_exact",
number_of_dimensions = 3,
output_prefix = "~/watershed_exact_n3")
res = evaluate_watershed(input_file = input,
model_name = "RIVER",
number_of_dimensions = 3,
output_prefix = "~/river_n3")
This function trains a Watershed model on training data and predicts Watershed posterior probabilities (using Watershed parameters optimized in training) on all gene-individual in a much larger prediction data set.
predict_watershed(training_input = input,
prediction_input = input,
model_name = "Watershed_approximate",
number_dimensions = 3,
output_prefix = "~/watershed_approximate_n3")
predict_watershed(training_input = input,
prediction_input = input,
model_name = "Watershed_exact",
number_dimensions = 3,
output_prefix = "~/watershed_exact_n3")
predict_watershed(training_input = input,
prediction_input = input,
model_name = "RIVER",
number_dimensions = 3,
output_prefix = "~/river_n3")