Train Watershed model on training data and predict Watershed posterior probabilities (using Watershed parameters optimized in training) on all gene-individual in a much larger prediction data set.
predict_watershed(
training_input,
prediction_input,
number_dimensions = 1,
model_name = "Watershed_exact",
dirichlet_prior_parameter = 10,
l2_prior_parameter = 0.1,
output_prefix = "watershed",
binary_pvalue_threshold = 0.1,
lambda_costs = c(0.1, 0.01, 0.001),
nfolds = 5,
vi_step_size = 0.8,
vi_threshold = 1e-08
)
String. The Watershed input file containing instances used to train the model.
Either a file path or a URL. For required format, see details of evaluate_watershed()
.
String. The Watershed input file containing instances to predict on.
Either a file path or a URL. For required format, see details of evaluate_watershed()
.
Integer representing the number of outlier types.
Sometimes referred to as E
in our documentation.
String identifier corresponding to the model to use. Options are "RIVER", "Watershed_exact", and "Watershed_approximate"
Float parameter defining Dirichlet distribution that acts
as a prior a Phi (the model parameters defining E|Z
)
Float defining the L2 (gaussian) distribution that acts
as a prior on the parameters defining the conditional random field P(Z|G)
.
If set to NULL, Watershed will run a grid search on held-out data to select an
optimal L2 prior. Default: 0.1
String corresponding to the prefix of all output files generated by this function
Float. Absolute p-value threshold used to create binary outliers used for Genomic Annotation Model. Default: 0.1
Numeric vector of length 3. If l2_prior_parameter
is NULL,
perform grid search over the following values of lambda to determine optimal lambda.
Default: c(.1, .01, 1e-3)
Integer. If l2_prior_parameter
is NULL, Number of folds
to be used in K-fold cross validation for Genomic annotation model. Default: 5
Float. Parameter used for Variational Optimization.
Only applies if model_name == "Watershed_approximate"
. Default: 0.8
Float. Parameter used for Variational Optimization.
Only applies if model_name == "Watershed_approximate"
. Default: 1e-8
"Watershed_exact" is Watershed where parameters are optimized via exact inference
(tractable and recommended when E
is small. A general rule of thumb is if E
is
less than equal to 4, exact inference should be used). "Watershed_approximate"
is Watershed where parameters are optimized using approximate inference.
This approach is tractable when E
is large.
This function saves a tab-separated file to ${output_prefix}posterior_probability.txt
.
Each line of this file corresponds to an instance (a line) in the prediction input file
$prediction_input
. The "sample_names" column provides the identifier for the gene-individual
pair corresponding to the given line. There is an additional column for each of the
E
outliers, where the column corresponding to outlier e
represents the
Watershed marginal posterior probability for outlier e
.
# Note for convenience, the training file is the same as the prediction file.
# This does not necessarily have to be the case.
input = paste0("https://raw.githubusercontent.com/BennyStrobes/Watershed/",
"master/example_data/watershed_example_data.txt")
# For all examples, use example data that has 3 E outlier p-value columns,
# which corresponds to number_of_dimensions = 3
if (FALSE) {
# Run using Watershed approximate inference
predict_watershed(training_input = input,
prediction_input = input,
model_name = "Watershed_approximate",
number_dimensions = 3,
output_prefix = "watershed_approximate_n3")
# Run using Watershed exact inference
predict_watershed(training_input = input,
prediction_input = input,
model_name = "Watershed_exact",
number_dimensions = 3,
output_prefix = "watershed_exact_n3")
# Run using RIVER
predict_watershed(training_input = input,
prediction_input = input,
model_name = "RIVER",
number_dimensions = 3,
output_prefix = "river_n3")
}