Train model on non-N2 pairs and evaluate model on held-out N2-pairs. This allows the user to get an idea of the accuracy of Watershed applied to their data.
evaluate_watershed(
input_file,
model_name,
number_of_dimensions = 1,
dirichlet_prior = 10,
l2_prior_parameter = NULL,
output_prefix = "watershed",
n2_pair_pvalue_fraction = 0.1,
binary_pvalue_threshold = 0.1,
lambda_costs = c(0.1, 0.01, 0.001),
nfolds = 5,
vi_step_size = 0.8,
vi_threshold = 1e-08
)
String corresponding to the file name of the Watershed input file. Either a file path or a URL. See details for required format.
String identifier corresponding to the model to use. Options are "RIVER", "Watershed_exact", and "Watershed_approximate"
Integer representing the number of outlier types.
Sometimes referred to as E
in our documentation.
Float parameter defining Dirichlet distribution that acts
as a prior a Phi (the model parameters defining E|Z
)
Float defining the L2 (gaussian) distribution that acts
as a prior on the parameters defining the conditional random field P(Z|G)
.
If set to NULL, Watershed will run a grid search on held-out data to select an
optimal L2 prior. We recommend setting this parameter to NULL. Default: NULL
String corresponding to the prefix of all output files generated by this function
Float. Fraction of outliers (based on rank) that are considered an outlier for N2 pair analysis (this is done so each outlier type/signal has approximately the same distribution of positive outlier examples). Default: 0.1
Float. Absolute p-value threshold used to create binary outliers used for Genomic Annotation Model. Default: 0.1
Numeric vector of length 3. If l2_prior_parameter
is NULL,
perform grid search over the following values of lambda to determine optimal lambda.
Default: c(.1, .01, 1e-3)
Integer. If l2_prior_parameter
is NULL, Number of folds
to be used in K-fold cross validation for Genomic annotation model. Default: 5
Float. Parameter used for Variational Optimization.
Only applies if model_name == "Watershed_approximate"
. Default: 0.8
Float. Parameter used for Variational Optimization.
Only applies if model_name == "Watershed_approximate"
. Default: 1e-8
A named list:
model_params
contains all of the optimized Watershed parameters
gam_model_params
contains all optimized GAM parameters
auc
contains information on evaluation (precion-recall stats, etc)
auc[[1]]$evaROC$watershed_pr_auc
gives the area under the precision recall curve
(for Watershed/RIVER) in the first outlier dimension (base 1)
auc[[3]]$evaROC$watershed_pr_auc
gives the area under the precision recall curve
(for Watershed/RIVER) in the third outlier dimension (base 1)
auc[[1]]$evaROC$GAM_pr_auc
gives the area under the precision recall curve
(for GAM) in the first outlier dimension (base 1)
auc[[1]]$evaROC$watershed_precision
gives a vector of precision values
(for Watershed/RIVER) in the first outlier dimension (base 1). The vector
spans posterior thresholds generated by the R library "PRROC"
auc[[1]]$evaROC$watershed_recall
gives a vector of recall values (for Watershed/RIVER)
in the first outlier dimension (base 1). This vector is the same length as precision vector above
(and can be therefore used together to make a PR-curve).
The vector spans posterior thresholds generated by the R library "PRROC"
auc[[1]]$evaROC$GAM_precision
gives a vector of precision values (for GAM)
in the first outlier dimension (base 1). The vector spans posterior
thresholds generated by PRROC
.
auc[[1]]$evaROC$GAM_recall
gives a vector of recall values
(for GAM) in the first outlier dimension (base 1). This vector is the same
length as precision vector above (and can be therefore used together to make a PR-curve).
The vector spans posterior thresholds generated by PRROC
.
"Watershed_exact" is Watershed where parameters are optimized via exact inference
(tractable and recommended when the number of dimensions (E
) is small.
A general rule of thumb is if the number of dimensions (E
) is less than equal to 4,
exact inference should be used). "Watershed_approximate" is Watershed where parameters
are optimized using approximate inference. This approach is tractable when the
number of dimensions (E
) is large.
Watershed models instances of gene-individual pairs. Therefore each line in the Watershed input file must be a gene-individual pair. Each line must include a minimum of one genomic annotation describing the rare variants nearby the gene for that individual. Each line must also include a minimum of one outlier score (p-value) describing the gene-individual pair. To evaluate Watershed performance we rely on N2 pairs, or pairs of individuals that share the same rare genetic variant nearby a particular gene (see the publication for more details). Therefore, each line in the input file must contain a column concerning whether the gene-individual is an N2 pair. If the gene-individual is not an N2 pair, it can be represented as "NA". If the gene-individual is in an N2 pair, it can be represented as a positive integer that represents the unique identifier for that N2 pair.
Note: if there are more than two individuals with the same rare genetic variant, you must randomly select two individuals and ignore all other individuals.
More formally, assuming G
genomic annotations and E
outlier p-values, the columns
of the Watershed input file are as follows (column ordering is essential):
SubjectID
Identifier for the individual in the gene-individual pair
GeneName
Identifier for the gene in the gene-individual pair
[Genomic Annotation]
A column for each of the G genomic annotations. Header names of these columns can be entirely user specified. Values are real valued.
[E outlier p-values]
A column for each of the E outlier p-values.
Header names of these columns can be entirely user specified. Values are p-values (i.e., [0,1]
).
It is okay to have missing/unobserved values. If an outlier is missing for a gene-individual pair,
use "NA" for that element.
N2pair
Identifier for whether gene-individual is in an N2 pair. If it is not, use "NA". If it is an N2 pair, use a unique positive integer to identify the pair.
Note on outlier p-values: We can explicitly model outlier calls with sign/direction
(ie. total expression outliers are generated from a zscore and can therefor be over-expression
or under-expression). To model this, if a particular outlier signal has sign/direction,
simply put a negative sign in front of the p-value for instances that are under-expression
and keep the over-expression outliers as postive valued.
An example of this can be seen in the pheno_2_pvalue
column of
https://raw.githubusercontent.com/BennyStrobes/Watershed/master/example_data/watershed_example_data.txt.
if (FALSE) {
# For all examples, use example data that has 3 E outlier p-value columns,
# which corresponds to number_of_dimensions = 3
input = paste0("https://raw.githubusercontent.com/BennyStrobes/Watershed/",
"master/example_data/watershed_example_data.txt")
# Run using Watershed approximate inference
evaluate_watershed(input_file = input,
model_name = "Watershed_approximate",
number_of_dimensions = 3,
output_prefix = "watershed_approximate_n3")
# Run using Watershed exact inference
evaluate_watershed(input_file = input,
model_name = "Watershed_exact",
number_of_dimensions = 3,
output_prefix = "watershed_exact_n3")
# Run using RIVER
evaluate_watershed(input_file = input,
model_name = "RIVER",
number_of_dimensions = 3,
output_prefix = "river_n3")
}