Train Watershed model on training data and predict Watershed posterior probabilities (using Watershed parameters optimized in training) on all gene-individual in a much larger prediction data set.

predict_watershed(
  training_input,
  prediction_input,
  number_dimensions = 1,
  model_name = "Watershed_exact",
  dirichlet_prior_parameter = 10,
  l2_prior_parameter = 0.1,
  output_prefix = "watershed",
  binary_pvalue_threshold = 0.1,
  lambda_costs = c(0.1, 0.01, 0.001),
  nfolds = 5,
  vi_step_size = 0.8,
  vi_threshold = 1e-08
)

Arguments

training_input

String. The Watershed input file containing instances used to train the model. Either a file path or a URL. For required format, see details of evaluate_watershed().

prediction_input

String. The Watershed input file containing instances to predict on. Either a file path or a URL. For required format, see details of evaluate_watershed().

number_dimensions

Integer representing the number of outlier types. Sometimes referred to as E in our documentation.

model_name

String identifier corresponding to the model to use. Options are "RIVER", "Watershed_exact", and "Watershed_approximate"

dirichlet_prior_parameter

Float parameter defining Dirichlet distribution that acts as a prior a Phi (the model parameters defining E|Z)

l2_prior_parameter

Float defining the L2 (gaussian) distribution that acts as a prior on the parameters defining the conditional random field P(Z|G). If set to NULL, Watershed will run a grid search on held-out data to select an optimal L2 prior. Default: 0.1

output_prefix

String corresponding to the prefix of all output files generated by this function

binary_pvalue_threshold

Float. Absolute p-value threshold used to create binary outliers used for Genomic Annotation Model. Default: 0.1

lambda_costs

Numeric vector of length 3. If l2_prior_parameter is NULL, perform grid search over the following values of lambda to determine optimal lambda. Default: c(.1, .01, 1e-3)

nfolds

Integer. If l2_prior_parameter is NULL, Number of folds to be used in K-fold cross validation for Genomic annotation model. Default: 5

vi_step_size

Float. Parameter used for Variational Optimization. Only applies if model_name == "Watershed_approximate". Default: 0.8

vi_threshold

Float. Parameter used for Variational Optimization. Only applies if model_name == "Watershed_approximate". Default: 1e-8

Details

"Watershed_exact" is Watershed where parameters are optimized via exact inference (tractable and recommended when E is small. A general rule of thumb is if E is less than equal to 4, exact inference should be used). "Watershed_approximate" is Watershed where parameters are optimized using approximate inference. This approach is tractable when E is large.

This function saves a tab-separated file to ${output_prefix}posterior_probability.txt. Each line of this file corresponds to an instance (a line) in the prediction input file $prediction_input. The "sample_names" column provides the identifier for the gene-individual pair corresponding to the given line. There is an additional column for each of the E outliers, where the column corresponding to outlier e represents the Watershed marginal posterior probability for outlier e.

Examples

# Note for convenience, the training file is the same as the prediction file. 
# This does not necessarily have to be the case.
input = paste0("https://raw.githubusercontent.com/BennyStrobes/Watershed/",
     "master/example_data/watershed_example_data.txt")

# For all examples, use example data that has 3 E outlier p-value columns, 
# which corresponds to number_of_dimensions = 3

if (FALSE) {
# Run using Watershed approximate inference
predict_watershed(training_input = input, 
                  prediction_input = input,
                  model_name = "Watershed_approximate", 
                  number_dimensions = 3,
                  output_prefix = "watershed_approximate_n3")

# Run using Watershed exact inference
predict_watershed(training_input = input, 
                  prediction_input = input,
                  model_name = "Watershed_exact", 
                  number_dimensions = 3,
                  output_prefix = "watershed_exact_n3")

# Run using RIVER
predict_watershed(training_input = input, 
                  prediction_input = input,
                  model_name = "RIVER", 
                  number_dimensions = 3,
                  output_prefix = "river_n3")
}