Train model on non-N2 pairs and evaluate model on held-out N2-pairs. This allows the user to get an idea of the accuracy of Watershed applied to their data.

evaluate_watershed(
  input_file,
  model_name,
  number_of_dimensions = 1,
  dirichlet_prior = 10,
  l2_prior_parameter = NULL,
  output_prefix = "watershed",
  n2_pair_pvalue_fraction = 0.1,
  binary_pvalue_threshold = 0.1,
  lambda_costs = c(0.1, 0.01, 0.001),
  nfolds = 5,
  vi_step_size = 0.8,
  vi_threshold = 1e-08
)

Arguments

input_file

String corresponding to the file name of the Watershed input file. Either a file path or a URL. See details for required format.

model_name

String identifier corresponding to the model to use. Options are "RIVER", "Watershed_exact", and "Watershed_approximate"

number_of_dimensions

Integer representing the number of outlier types. Sometimes referred to as E in our documentation.

dirichlet_prior

Float parameter defining Dirichlet distribution that acts as a prior a Phi (the model parameters defining E|Z)

l2_prior_parameter

Float defining the L2 (gaussian) distribution that acts as a prior on the parameters defining the conditional random field P(Z|G). If set to NULL, Watershed will run a grid search on held-out data to select an optimal L2 prior. We recommend setting this parameter to NULL. Default: NULL

output_prefix

String corresponding to the prefix of all output files generated by this function

n2_pair_pvalue_fraction

Float. Fraction of outliers (based on rank) that are considered an outlier for N2 pair analysis (this is done so each outlier type/signal has approximately the same distribution of positive outlier examples). Default: 0.1

binary_pvalue_threshold

Float. Absolute p-value threshold used to create binary outliers used for Genomic Annotation Model. Default: 0.1

lambda_costs

Numeric vector of length 3. If l2_prior_parameter is NULL, perform grid search over the following values of lambda to determine optimal lambda. Default: c(.1, .01, 1e-3)

nfolds

Integer. If l2_prior_parameter is NULL, Number of folds to be used in K-fold cross validation for Genomic annotation model. Default: 5

vi_step_size

Float. Parameter used for Variational Optimization. Only applies if model_name == "Watershed_approximate". Default: 0.8

vi_threshold

Float. Parameter used for Variational Optimization. Only applies if model_name == "Watershed_approximate". Default: 1e-8

Value

A named list:

  • model_paramscontains all of the optimized Watershed parameters

  • gam_model_paramscontains all optimized GAM parameters

  • auccontains information on evaluation (precion-recall stats, etc)

    • auc[[1]]$evaROC$watershed_pr_aucgives the area under the precision recall curve (for Watershed/RIVER) in the first outlier dimension (base 1)

    • auc[[3]]$evaROC$watershed_pr_aucgives the area under the precision recall curve (for Watershed/RIVER) in the third outlier dimension (base 1)

    • auc[[1]]$evaROC$GAM_pr_aucgives the area under the precision recall curve (for GAM) in the first outlier dimension (base 1)

    • auc[[1]]$evaROC$watershed_precisiongives a vector of precision values (for Watershed/RIVER) in the first outlier dimension (base 1). The vector spans posterior thresholds generated by the R library "PRROC"

    • auc[[1]]$evaROC$watershed_recallgives a vector of recall values (for Watershed/RIVER) in the first outlier dimension (base 1). This vector is the same length as precision vector above (and can be therefore used together to make a PR-curve). The vector spans posterior thresholds generated by the R library "PRROC"

    • auc[[1]]$evaROC$GAM_precisiongives a vector of precision values (for GAM) in the first outlier dimension (base 1). The vector spans posterior thresholds generated by PRROC.

    • auc[[1]]$evaROC$GAM_recallgives a vector of recall values (for GAM) in the first outlier dimension (base 1). This vector is the same length as precision vector above (and can be therefore used together to make a PR-curve). The vector spans posterior thresholds generated by PRROC.

Details

"Watershed_exact" is Watershed where parameters are optimized via exact inference (tractable and recommended when the number of dimensions (E) is small. A general rule of thumb is if the number of dimensions (E) is less than equal to 4, exact inference should be used). "Watershed_approximate" is Watershed where parameters are optimized using approximate inference. This approach is tractable when the number of dimensions (E) is large.

Watershed models instances of gene-individual pairs. Therefore each line in the Watershed input file must be a gene-individual pair. Each line must include a minimum of one genomic annotation describing the rare variants nearby the gene for that individual. Each line must also include a minimum of one outlier score (p-value) describing the gene-individual pair. To evaluate Watershed performance we rely on N2 pairs, or pairs of individuals that share the same rare genetic variant nearby a particular gene (see the publication for more details). Therefore, each line in the input file must contain a column concerning whether the gene-individual is an N2 pair. If the gene-individual is not an N2 pair, it can be represented as "NA". If the gene-individual is in an N2 pair, it can be represented as a positive integer that represents the unique identifier for that N2 pair.

Note: if there are more than two individuals with the same rare genetic variant, you must randomly select two individuals and ignore all other individuals.

More formally, assuming G genomic annotations and E outlier p-values, the columns of the Watershed input file are as follows (column ordering is essential):

SubjectID

Identifier for the individual in the gene-individual pair

GeneName

Identifier for the gene in the gene-individual pair

[Genomic Annotation]

A column for each of the G genomic annotations. Header names of these columns can be entirely user specified. Values are real valued.

[E outlier p-values]

A column for each of the E outlier p-values. Header names of these columns can be entirely user specified. Values are p-values (i.e., [0,1]). It is okay to have missing/unobserved values. If an outlier is missing for a gene-individual pair, use "NA" for that element.

N2pair

Identifier for whether gene-individual is in an N2 pair. If it is not, use "NA". If it is an N2 pair, use a unique positive integer to identify the pair.

Note on outlier p-values: We can explicitly model outlier calls with sign/direction (ie. total expression outliers are generated from a zscore and can therefor be over-expression or under-expression). To model this, if a particular outlier signal has sign/direction, simply put a negative sign in front of the p-value for instances that are under-expression and keep the over-expression outliers as postive valued. An example of this can be seen in the pheno_2_pvalue column of https://raw.githubusercontent.com/BennyStrobes/Watershed/master/example_data/watershed_example_data.txt.

Examples

if (FALSE) {
# For all examples, use example data that has 3 E outlier p-value columns, 
# which corresponds to number_of_dimensions = 3
input = paste0("https://raw.githubusercontent.com/BennyStrobes/Watershed/",
     "master/example_data/watershed_example_data.txt")

# Run using Watershed approximate inference
evaluate_watershed(input_file = input,
                   model_name = "Watershed_approximate", 
                   number_of_dimensions = 3,
                   output_prefix = "watershed_approximate_n3")

# Run using Watershed exact inference
evaluate_watershed(input_file = input, 
                   model_name = "Watershed_exact", 
                   number_of_dimensions = 3,
                   output_prefix = "watershed_exact_n3")

# Run using RIVER
evaluate_watershed(input_file = input, 
                   model_name = "RIVER", 
                   number_of_dimensions = 3,
                   output_prefix = "river_n3")
}