Title: | Quality Control and Semantic Enrichment of Datasets |
---|---|
Description: | A tool for the preparation and enrichment of health datasets for analysis (Toner et al. (2023) <doi:10.1093/gigascience/giad030>). Provides functionality for assessing data quality and for improving the reliability and machine interpretability of a dataset. 'eHDPrep' also enables semantic enrichment of a dataset where metavariables are discovered from the relationships between input variables determined from user-provided ontologies. |
Authors: | Tom Toner [aut] , Ian Overton [aut, cre] |
Maintainer: | Ian Overton <[email protected]> |
License: | GPL-3 |
Version: | 1.3.3.9000 |
Built: | 2024-11-12 05:13:00 UTC |
Source: | https://github.com/overton-group/ehdprep |
The primary high level function for quality control. Applies several quality control functions in sequence to input data frame (see Details for individual functions).
apply_quality_ctrl( data, id_var, class_tbl, bin_cats = NULL, min_freq = 1, to_numeric_matrix = FALSE )
apply_quality_ctrl( data, id_var, class_tbl, bin_cats = NULL, min_freq = 1, to_numeric_matrix = FALSE )
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
id_var |
An unquoted expression which corresponds to a variable (column) in
|
class_tbl |
data frame such as the output tibble from
|
bin_cats |
Optional named vector of user-defined values for binary
values using |
min_freq |
Minimum frequency of occurrence
|
to_numeric_matrix |
Should QC'ed data be converted to a numeric matrix? Default: FALSE. |
The wrapped functions are applied in the following order:
Standardise missing values (strings_to_NA
)
Encode binary categorical variables (columns) (encode_binary_cats
)
Encode (specific) ordinal variables (columns)(encode_ordinals
)
Encode genotype variables (encode_genotypes
)
Extract information from free text variables (columns) (extract_freetext
)
Encode non-binary categorical variables (columns) (encode_cats
)
Encode output as numeric matrix (optional, encode_as_num_mat
)
class_tbl
is used to apply the above functions to the appropriate variables (columns).
data
with several QC measures applied.
Other high level functionality:
assess_quality()
,
review_quality_ctrl()
,
semantic_enrichment()
data(example_data) require(tibble) # create an example class_tbl object # note that diabetes_type is classes as ordinal and is not modified as its # levels are not pre-coded tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types data_QC <- apply_quality_ctrl(example_data, patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6)
data(example_data) require(tibble) # create an example class_tbl object # note that diabetes_type is classes as ordinal and is not modified as its # levels are not pre-coded tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types data_QC <- apply_quality_ctrl(example_data, patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6)
Assesses and visualises completeness of the input data across both rows (samples) and columns (variables).
assess_completeness(data, id_var, plot = TRUE)
assess_completeness(data, id_var, plot = TRUE)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
id_var |
An unquoted expression which corresponds to a variable (column) in
|
plot |
Should plots be rendered when function is run? (Default: TRUE) |
Returns a list of completeness assessments:
A tibble detailing completeness of variables (columns)
(via variable_completeness
).
A tibble detailing completeness of rows (via
row_completeness
).
A plot of row and variable (column) completeness (via
plot_completeness
).
A clustered heatmap of cell completeness (via
completeness_heatmap
).
A function which creates a clean canvas before plotting the completeness heatmap.
list of completeness tibbles and plots
Other measures of completeness:
compare_completeness()
,
completeness_heatmap()
,
plot_completeness()
,
row_completeness()
,
variable_completeness()
data(example_data) res <- assess_completeness(example_data, patient_id) # variable completeness table res$variable_completeness # row completeness table res$row_completeness # show completeness of rows and variables as a bar plot res$completeness_plot # show dataset completeness in a clustered heatmap # (this is similar to res$completeness_heatmap but ensures a blank canvas is first created) res$plot_completeness_heatmap(res)
data(example_data) res <- assess_completeness(example_data, patient_id) # variable completeness table res$variable_completeness # row completeness table res$row_completeness # show completeness of rows and variables as a bar plot res$completeness_plot # show dataset completeness in a clustered heatmap # (this is similar to res$completeness_heatmap but ensures a blank canvas is first created) res$plot_completeness_heatmap(res)
Provides information on the quality of a dataset. Assesses dataset's completeness, internal consistency, and entropy.
assess_quality(data, id_var, consis_tbl)
assess_quality(data, id_var, consis_tbl)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
id_var |
An unquoted expression which corresponds to a variable (column) in
|
consis_tbl |
data frame or tibble containing information on internal consistency rules (see "Consistency Table Requirements" section) |
Wraps several quality assessment functions from eHDPrep
and returns a nested list with the following structure:
- A list of completeness assessments:
Tibble of variable (column) completeness (via variable_completeness
)
Tibble of row (sample) completeness (via row_completeness
)
Plot of row and variable completeness (via plot_completeness
)
Completeness heatmap (via completeness_heatmap
)
A function which creates a clean canvas before plotting the completeness heatmap.
- Tibble of internal inconsistencies, if any
are present and if a consistency table is supplied (via
identify_inconsistency
).
- Names of variables (columns) with zero entropy (via
zero_entropy_variables
)
Nested list of quality measurements
Table must have exactly five character columns. The columns should be ordered according to the list below which describes the values of each column:
First column name of data values that will be subject to consistency checking. String. Required.
Second column name of data values that will be subject to consistency checking. String. Required.
Logical test to compare columns one and two. One of: ">",">=",
"<","<=","==", "!=". String. Optional if columns 4 and 5 have non-NA
values.
Either a single character string or a colon-separated range of
numbers which should only appear in column A. Optional if column 3 has a
non-NA
value.
Either a single character string or a colon-separated range of
numbers which should only appear in column B given the value/range
specified in column 4. Optional if column 3 has a non-NA
value.
Each row should detail one test to make.
Therefore, either column 3 or columns 4 and 5 must contain non-NA
values.
Other high level functionality:
apply_quality_ctrl()
,
review_quality_ctrl()
,
semantic_enrichment()
# general example data(example_data) res <- assess_quality(example_data, patient_id) # example of internal consistency checks on more simple dataset # describing bean counts require(tibble) # creating `data`: beans <- tibble::tibble(red_beans = 1:15, blue_beans = 1:15, total_beans = 1:15*2, red_bean_summary = c(rep("few_beans",9), rep("many_beans",6))) # creating `consis_tbl` bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries, "red_beans", "blue_beans", "==", NA, NA, "red_beans", "total_beans", "<=", NA,NA, "red_beans", "red_bean_summary", NA, "1:9", "few_beans", "red_beans", "red_bean_summary", NA, "10:15", "many_beans") # add some inconsistencies beans[1, "red_bean_summary"] <- "many_beans" beans[1, "red_beans"] <- 10 res <- assess_quality(beans, consis_tbl = bean_rules) # variable completeness table res$completeness$variable_completeness # row completeness table res$completeness$row_completeness # show completeness of rows and variables as a bar plot res$completeness$completeness_plot # show dataset completeness in a clustered heatmap res$completeness$plot_completeness_heatmap(res$completeness) # show any internal inconsistencies res$internal_inconsistency # show any variables with zero entropy res$vars_with_zero_entropy
# general example data(example_data) res <- assess_quality(example_data, patient_id) # example of internal consistency checks on more simple dataset # describing bean counts require(tibble) # creating `data`: beans <- tibble::tibble(red_beans = 1:15, blue_beans = 1:15, total_beans = 1:15*2, red_bean_summary = c(rep("few_beans",9), rep("many_beans",6))) # creating `consis_tbl` bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries, "red_beans", "blue_beans", "==", NA, NA, "red_beans", "total_beans", "<=", NA,NA, "red_beans", "red_bean_summary", NA, "1:9", "few_beans", "red_beans", "red_bean_summary", NA, "10:15", "many_beans") # add some inconsistencies beans[1, "red_bean_summary"] <- "many_beans" beans[1, "red_beans"] <- 10 res <- assess_quality(beans, consis_tbl = bean_rules) # variable completeness table res$completeness$variable_completeness # row completeness table res$completeness$row_completeness # show completeness of rows and variables as a bar plot res$completeness$completeness_plot # show dataset completeness in a clustered heatmap res$completeness$plot_completeness_heatmap(res$completeness) # show any internal inconsistencies res$internal_inconsistency # show any variables with zero entropy res$vars_with_zero_entropy
Classes/data types of data variables are assumed with this function and
exported to a .csv file for amendment. Any incorrect classes can then be
corrected and imported using import_var_classes
.
assume_var_classes(data, out_file = NULL)
assume_var_classes(data, out_file = NULL)
data |
data frame |
out_file |
file where variables and their assumed classes are stored for user verification. |
Writes a .csv file containing the variables and their assumed data types / classes.
# example below assumes incorrectly for several variables tmp = tempfile(fileext = ".csv") data(example_data) assume_var_classes(example_data, tmp)
# example below assumes incorrectly for several variables tmp = tempfile(fileext = ".csv") data(example_data) assume_var_classes(example_data, tmp)
Adds colour highlighting to cell values if they are encoded as logical
values. Output should then be passed to knitr
's kable
function.
cellspec_lgl(.data, rg = FALSE)
cellspec_lgl(.data, rg = FALSE)
.data |
Table to be highlighted. |
rg |
Should red and green be used for |
This is useful for identifying the encoding used in a value (e.g. the
difference between the string "TRUE" and truth value of logic TRUE
).
This highlighting can also be useful when visually assessing cell values in
a table. The colour naming format (HTML or LaTeX) is automatically detected.
There are four cell types considered:
non-logical cells are coloured black)
TRUE
cells are coloured red (default) or green if rg
is TRUE
FALSE
cells are coloured cyan (default) or red if rg
is TRUE
NA
cells are coloured gray
Note: When passed to kable()
, the escape
parameter should be
FALSE
for colours to be rendered correctly.
Table with cell colours specified.
Produces a density plot comparing the completeness of two datasets
(tbl_a
and tbl_b
) for variables (if dim
== 2, default)
or row (if dim
== 1). The label used to identify the dataset's density
curve can be specified using tbl_a_lab
and tbl_b_lab
.
compare_completeness(tbl_a, tbl_b, dim = 2, tbl_a_lab = NULL, tbl_b_lab = NULL)
compare_completeness(tbl_a, tbl_b, dim = 2, tbl_a_lab = NULL, tbl_b_lab = NULL)
tbl_a |
Data frame of the first data frame to compare. |
tbl_b |
Data frame of the second data frame to compare. |
dim |
Integer. Dimension to measure completeness on. 2 (Default) measures completeness by variable. 1 measures completeness by row. |
tbl_a_lab |
String to be used to label |
tbl_b_lab |
String to be used to label |
Plot showing densities of completeness across both datasets.
Other measures of completeness:
assess_completeness()
,
completeness_heatmap()
,
plot_completeness()
,
row_completeness()
,
variable_completeness()
data(example_data) compare_completeness(example_data, strings_to_NA(example_data), dim = 2, "raw", "cleaned")
data(example_data) compare_completeness(example_data, strings_to_NA(example_data), dim = 2, "raw", "cleaned")
Used to quantify the amount of information loss, if any, which has occurred in a merging procedure between two discrete variables.
compare_info_content(input1, input2, composite)
compare_info_content(input1, input2, composite)
input1 |
Character vector. First variable to compare |
input2 |
Character vector. Second variable to compare |
composite |
Character vector. Composite variable, resultant of merging
|
The function requires the two discrete variables which have been
merged (input1
and input2
) and the composite variable
(output
). For each input, information content is calculated using
information_content_discrete
along with each input's mutual
information content with the composite variable using
mi_content_discrete
. The function returns a table describing
these measures.
If the mutual information content between an input variable and the composite variable is equal to the information content of the input variable, it is confirmed that all information in the input variable has been incorporated into the composite variable. However, if one or both input variables' information content is not equal to their mutual information with the composite variables, information loss has occurred.
Table containing information content for input1
and
input2
and their mutual information content with composite
.
data(example_data) require(dplyr) require(magrittr) example_data %>% mutate(diabetes_merged = coalesce(diabetes_type, diabetes)) %>% select(starts_with("diabetes")) -> merged_data compare_info_content(merged_data$diabetes, merged_data$diabetes_type, merged_data$diabetes_merged)
data(example_data) require(dplyr) require(magrittr) example_data %>% mutate(diabetes_merged = coalesce(diabetes_type, diabetes)) %>% select(starts_with("diabetes")) -> merged_data compare_info_content(merged_data$diabetes, merged_data$diabetes_type, merged_data$diabetes_merged)
This function requires the output from compare_info_content
.
It is used to visualise the amount of information loss, if any, which has
occurred in a merging procedure between two discrete variables.
compare_info_content_plt(compare_info_content_res)
compare_info_content_plt(compare_info_content_res)
compare_info_content_res |
Output from
|
If the mutual information content between an input variable and the composite variable is equal to the information content of the input variable, it is confirmed that all information in the input variable has been incorporated into the composite variable.
Plot of measures calculated in compare_info_content
.
data(example_data) require(dplyr) require(magrittr) example_data %>% mutate(diabetes_merged = coalesce(diabetes_type, diabetes)) %>% select(starts_with("diabetes")) -> merged_data compare_info_content(merged_data$diabetes, merged_data$diabetes_type, merged_data$diabetes_merged) %>% compare_info_content_plt()
data(example_data) require(dplyr) require(magrittr) example_data %>% mutate(diabetes_merged = coalesce(diabetes_type, diabetes)) %>% select(starts_with("diabetes")) -> merged_data compare_info_content(merged_data$diabetes, merged_data$diabetes_type, merged_data$diabetes_merged) %>% compare_info_content_plt()
Produces a heatmap visualising completeness across a dataset.
completeness_heatmap( data, id_var, annotation_tbl = NULL, method = 1, show_rownames = FALSE, ... )
completeness_heatmap( data, id_var, annotation_tbl = NULL, method = 1, show_rownames = FALSE, ... )
data |
Data frame to be analysed. |
id_var |
Character constant of row identifier variable name. |
annotation_tbl |
Data frame containing variable annotation data. Column 1 should contain variable names, column 2 should contain an annotation label. |
method |
Integer between 1 and 3. Default: 1. See Details for more information. |
show_rownames |
Boolean. Should rownames be shown. Default: False. |
... |
Parameters to be passed to |
Method 1: Missing values are numerically encoded with a
highly negative number, numerically distant from all values in data
,
using distant_neg_val
. Values in categorical variables
are replaced with the number of unique values in the variable. Clustering
uses these values. Cells are coloured by presence (yellow = missing; blue =
present).
Method 2: Same as Method 1 but cells are coloured by values used to cluster.
Method 3: Values in data
are encoded as Boolean
values for clustering (present values = 1; missing values = 0). Cells are
coloured by presence (yellow = missing; blue = present).
completeness heatmap
See examples of how to plot using plot.new(). This is ensure a new plot is created for the heatmap
Kolde R (2019). _pheatmap: Pretty Heatmaps_. R package version 1.0.12, <https://CRAN.R-project.org/package=pheatmap>.
Other measures of completeness:
assess_completeness()
,
compare_completeness()
,
plot_completeness()
,
row_completeness()
,
variable_completeness()
data(example_data) # heatmap without variable category annotations: hm <- completeness_heatmap(example_data,patient_id) plot.new() # ensure new plot is created hm # heatmap with variable category annotations: ## create a dataframe containing variable annotations tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types hm <- completeness_heatmap(example_data,patient_id, annotation_tbl = data_types) plot.new() # ensure new plot is created hm
data(example_data) # heatmap without variable category annotations: hm <- completeness_heatmap(example_data,patient_id) plot.new() # ensure new plot is created hm # heatmap with variable category annotations: ## create a dataframe containing variable annotations tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types hm <- completeness_heatmap(example_data,patient_id, annotation_tbl = data_types) plot.new() # ensure new plot is created hm
Performs comparison of variables before and after a change has been applied in order to allow manual inspection and review of modifications made during the dataset preparation process.
count_compare( cols2compare, before_tbl = NULL, after_tbl = NULL, only_diff = FALSE, kableout = TRUE, caption = NULL, latex_wrap = FALSE )
count_compare( cols2compare, before_tbl = NULL, after_tbl = NULL, only_diff = FALSE, kableout = TRUE, caption = NULL, latex_wrap = FALSE )
cols2compare |
Variables to compare between tables. |
before_tbl |
Data frame from before modification was made. |
after_tbl |
Data frame from after modification was made. |
only_diff |
Keep only rows which differ between the tables (good for variables with many unique values, such as numeric variables). |
kableout |
Should output be a |
caption |
Caption for |
latex_wrap |
Should tables be aligned vertically rather than horizontally? Useful for wide table which would otherwise run off a page in LaTeX format. |
The purpose of this function is to summarise individual alterations in a
dataset and works best with categorical variables. The output contains two
tables derived from the parameters before_tbl
and after_tbl
.
Each table shows the unique combinations of values in variables specified in
the parameter cols2compare
if the variable is present. The tables are
presented as two sub-tables and therefore share a single table caption. This
caption is automatically generated describing the content of the two
sub-tables when the parameter caption
is not specified. The
default output is a kable
containing two sub-kables however if the
parameter kableout
is FALSE
, a list containing the two
tibble
s are returned. This may preferable for further analysis on the
tables' contents.
Returns list of two tibbles or a kable (see kableout
argument), each tallying unique values in specified columns in each input
table.
# merge data as the example modification example_data_merged <- merge_cols(example_data, diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) # review the differences between the input and output of the variable merging step above: count_compare(before_tbl = example_data, after_tbl = example_data_merged, cols2compare = c("diabetes", "diabetes_type", "diabetes_merged"), kableout = FALSE)
# merge data as the example modification example_data_merged <- merge_cols(example_data, diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) # review the differences between the input and output of the variable merging step above: count_compare(before_tbl = example_data, after_tbl = example_data_merged, cols2compare = c("diabetes", "diabetes_type", "diabetes_merged"), kableout = FALSE)
Compute mutual information between all rows of a matrix containing discrete outcomes.
discrete.mi(mat, progress.bar = FALSE)
discrete.mi(mat, progress.bar = FALSE)
mat |
A matrix of discrete values |
progress.bar |
Outputs status to terminal when set to 'text', or no
status updates are output when set to |
Note that only the lower triangle of the matrix is populated for speed, as the result is symmetric. Takes a matrix as input.
A lower triangular matrix where element [i,j] contains the mutual information in bits between row i and row j of the input matrix
Alexander Lyulph Robert Lubbock, Ian Overton
Returns a numeric value which is distant from the values in data
using
the following equation:
distant_neg_val(data)
distant_neg_val(data)
data |
data frame. |
Numeric vector of length 1
A edge table, as a data frame, is converted to a directed tidygraph
tidygraph
. Column 1 of the edge table is
interpreted as a "from" column, Column 2 is interpreted as a "to" column, and
any further columns are interpreted as attributes of the entity/node recorded
in column 1. Incomplete cases are removed from the edge table (rows) to avoid
redundancy
edge_tbl_to_graph(edge_tbl)
edge_tbl_to_graph(edge_tbl)
edge_tbl |
data frame containing 'from' nodes in column 1 and 'to' nodes in column 2 so that all nodes go 'towards' the root node |
tidygraph
representation of the edge table
# basic edge table edge_tbl <- tibble::tribble(~from, ~to, "Nstage", "TNM", "Tstage", "TNM", "Tumoursize", "property_of_tumour", "Tstage", "property_of_tumour", "property_of_tumour", "property_of_cancer", "TNM", "property_of_cancer", "property_of_cancer", "disease", "disease", "root", "root", NA) graph <- edge_tbl_to_graph(edge_tbl) graph plot(graph) # edge table with node attributes ## note that root node is included in final row to include its label edge_tbl <- tibble::tribble(~from, ~to, ~label, "Nstage", "TNM", "N stage", "Tstage", "TNM", "T stage", "Tumoursize", "property_of_tumour", "Tumour size", "Tstage", "property_of_tumour", "T stage", "property_of_tumour", "property_of_cancer", "Property of tumour", "TNM", "property_of_cancer", "TNM", "property_of_cancer", "disease", "Property of cancer", "disease", "root", "Disease", "root", NA, "Ontology Root") graph <- edge_tbl_to_graph(edge_tbl) graph plot(graph)
# basic edge table edge_tbl <- tibble::tribble(~from, ~to, "Nstage", "TNM", "Tstage", "TNM", "Tumoursize", "property_of_tumour", "Tstage", "property_of_tumour", "property_of_tumour", "property_of_cancer", "TNM", "property_of_cancer", "property_of_cancer", "disease", "disease", "root", "root", NA) graph <- edge_tbl_to_graph(edge_tbl) graph plot(graph) # edge table with node attributes ## note that root node is included in final row to include its label edge_tbl <- tibble::tribble(~from, ~to, ~label, "Nstage", "TNM", "N stage", "Tstage", "TNM", "T stage", "Tumoursize", "property_of_tumour", "Tumour size", "Tstage", "property_of_tumour", "T stage", "property_of_tumour", "property_of_cancer", "Property of tumour", "TNM", "property_of_cancer", "TNM", "property_of_cancer", "disease", "Property of cancer", "disease", "root", "Disease", "root", NA, "Ontology Root") graph <- edge_tbl_to_graph(edge_tbl) graph plot(graph)
Converts all columns to numeric and uses the row identifier column
(id_var
) as row names.
encode_as_num_mat(data, id_var)
encode_as_num_mat(data, id_var)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
id_var |
An unquoted expression which corresponds to a variable in
|
Numeric matrix with id_var
values as row names
require(dplyr) require(magrittr) mtcars %>% dplyr::as_tibble(rownames = "id") %>% encode_as_num_mat(id)
require(dplyr) require(magrittr) mtcars %>% dplyr::as_tibble(rownames = "id") %>% encode_as_num_mat(id)
In a character vector, converts binary categories to factor levels.
encode_bin_cat_vec(x, values = NULL, numeric_out = FALSE)
encode_bin_cat_vec(x, values = NULL, numeric_out = FALSE)
x |
non-numeric input vector |
values |
Optional named vector of user-defined values for binary values
using |
numeric_out |
If true, numeric vector is returned. If false, factor is returned. |
Binary categories to convert can be specified with a named character vector,
specified in values
. The syntax of the named vector is:
negative_finding = positive_finding
. If values
is not
provided, the default list will be used: "No"="Yes", "No/unknown" =
"Yes", "no/unknown" = "Yes", "Non-user" = "User", "Never" = "Ever", "WT" =
"MT"
.
Factor with false finding encoded as 1 and true finding encoded as
2. Alternatively, numeric vector if numeric_out
parameter is
TRUE
.
In a data frame, converts binary categories to factors. Ordering of levels is
standardised to: negative_finding, positive_finding
. This embeds a
standardised numeric relationship between the binary categories while
preserving value labels.
encode_binary_cats(data, ..., values = NULL)
encode_binary_cats(data, ..., values = NULL)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
values |
Optional named vector of user-defined values for binary values
using |
Binary categories to convert can be specified with a named character vector,
specified in values
. The syntax of the named vector is:
negative_finding = positive_finding
. If values
is not
provided, the default list will be used: "No"="Yes", "No/unknown" =
"Yes", "no/unknown" = "Yes", "Non-user" = "User", "Never" = "Ever", "WT" =
"MT"
.
dataset with specified binary categories converted to factors.
# use built-in values. Note: rural_urban is not modified # Note: diabetes is not modified because "missing" is interpreted as a third category. # strings_to_NA() should be applied first encode_binary_cats(example_data, hypertension, rural_urban) # use custom values. Note: rural_urban is now modified as well. encoded_data <- encode_binary_cats(example_data, hypertension, rural_urban, values = c("No"= "Yes", "rural" = "urban")) # to demonstrate the new numeric encoding: dplyr::mutate(encoded_data, hypertension_num = as.numeric(hypertension), .keep = "used")
# use built-in values. Note: rural_urban is not modified # Note: diabetes is not modified because "missing" is interpreted as a third category. # strings_to_NA() should be applied first encode_binary_cats(example_data, hypertension, rural_urban) # use custom values. Note: rural_urban is now modified as well. encoded_data <- encode_binary_cats(example_data, hypertension, rural_urban, values = c("No"= "Yes", "rural" = "urban")) # to demonstrate the new numeric encoding: dplyr::mutate(encoded_data, hypertension_num = as.numeric(hypertension), .keep = "used")
Variables specified in ...
are replaced with new variables describing
the presence of each unique category. Generated variable names have space
characters replaced with "_" and commas are removed.
encode_cats(data, ...)
encode_cats(data, ...)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
Tibble with converted variables.
require(magrittr) require(dplyr) data(example_data) # encode one variable encode_cats(example_data, marital_status) %>% select(starts_with("marital_status")) # encode multiple variables encoded <- encode_cats(example_data, diabetes, marital_status) select(encoded, starts_with("marital_status")) # diabetes_type included below but was not modified: select(encoded, starts_with("diabetes"))
require(magrittr) require(dplyr) data(example_data) # encode one variable encode_cats(example_data, marital_status) %>% select(starts_with("marital_status")) # encode multiple variables encoded <- encode_cats(example_data, diabetes, marital_status) select(encoded, starts_with("marital_status")) # diabetes_type included below but was not modified: select(encoded, starts_with("diabetes"))
Standardises homozygous SNP alleles (e.g. recorded as 'A') to two character form (e.g. 'A/A') and orders heterozygous SNP alleles alphabetically (e.g. "GA" becomes "A/G"). The SNP values are then converted from a character vector to an ordered factor, ordered by SNP allele frequency (e.g. most frequent SNP allele is 1, second most frequent value is 2, and least frequent values is 3). This method embeds the numeric relationship between the SNP allele frequencies while preserving value labels.
encode_genotype_vec(x)
encode_genotype_vec(x)
x |
input vector containing genotype data |
Ordered factor, ordered by allele frequency in variable
Standardises homozygous SNPs (e.g. recorded as "A") to two character form (e.g. "A/A") and orders heterozygous SNPs alphabetically (e.g. "GA" becomes "A/G"). The SNP values are then converted from a character vector to an ordered factor, ordered by observed allele frequency (in the supplied cohort). The most frequent allele is assigned level 1, the second most frequent value is assigned level 2, and the least frequent values is assigned level 3). This method embeds the numeric relationship between the allele frequencies while preserving value labels.
encode_genotypes(data, ...)
encode_genotypes(data, ...)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
'data' with variables (...
) encoded as standardised genotypes
data(example_data) require(dplyr) require(magrittr) # one variable encode_genotypes(example_data, SNP_a) %>% select(SNP_a) # multiple variables encode_genotypes(example_data, SNP_a, SNP_b) %>% select(SNP_a, SNP_b) # using tidyselect helpers encode_genotypes(example_data, dplyr::starts_with("SNP")) %>% select(starts_with("SNP"))
data(example_data) require(dplyr) require(magrittr) # one variable encode_genotypes(example_data, SNP_a) %>% select(SNP_a) # multiple variables encode_genotypes(example_data, SNP_a, SNP_b) %>% select(SNP_a, SNP_b) # using tidyselect helpers encode_genotypes(example_data, dplyr::starts_with("SNP")) %>% select(starts_with("SNP"))
Converts character or factor variables in the input data frame to ordered factors embedding numeric relationship between values while preserving value labels.
encode_ordinals(data, ord_levels, ..., strict_levels = TRUE)
encode_ordinals(data, ord_levels, ..., strict_levels = TRUE)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
ord_levels |
character vector containing values in desired order (lowest to highest). |
... |
< |
strict_levels |
logical constant. If |
dataframe with specified variables encoded as ordered factors.
data(example_data) require(dplyr) require(magrittr) encode_ordinals(example_data, ord_levels = c("N0","N1","N2"), n_stage) # Note: "unequivocal" is present in t_stage but not in `ord_levels`. # with `strict_levels` TRUE, t_stage is unmodified and a warning message is given: encode_ordinals(example_data, ord_levels = c("T1","T2","T3a", "T3b", "T4"), strict_levels = TRUE, t_stage) %>% select(t_stage) # with `strict_levels` FALSE, it is replaced with NA: encode_ordinals(example_data, ord_levels = c("T1","T2","T3a", "T3b", "T4"), strict_levels = FALSE, t_stage) %>% select(t_stage)
data(example_data) require(dplyr) require(magrittr) encode_ordinals(example_data, ord_levels = c("N0","N1","N2"), n_stage) # Note: "unequivocal" is present in t_stage but not in `ord_levels`. # with `strict_levels` TRUE, t_stage is unmodified and a warning message is given: encode_ordinals(example_data, ord_levels = c("T1","T2","T3a", "T3b", "T4"), strict_levels = TRUE, t_stage) %>% select(t_stage) # with `strict_levels` FALSE, it is replaced with NA: encode_ordinals(example_data, ord_levels = c("T1","T2","T3a", "T3b", "T4"), strict_levels = FALSE, t_stage) %>% select(t_stage)
Calculates Shannon Entropy of a vector in bits (default) or natural units. Missing values are omitted from the calculation.
entropy(x, unit = c("bits"))
entropy(x, unit = c("bits"))
x |
Input vector |
unit |
Unit to measure entropy. Either "bits" (default) or "nats". |
Entropy of input variable
Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423 (1948).
# no entropy: vec <- c(1,1,1,1,1,1) entropy(vec) # entropy vec <- c(1,2,3,4,5,6) entropy(vec)
# no entropy: vec <- c(1,1,1,1,1,1) entropy(vec) # entropy vec <- c(1,2,3,4,5,6) entropy(vec)
Calculates KDE for a set of points exactly, rather than an approximation as per the density() core function.
exact.kde(x, bw, output.domain = x, na.rm = FALSE)
exact.kde(x, bw, output.domain = x, na.rm = FALSE)
x |
A numeric vector of values |
bw |
The bandwidth to use - either a single value, or a vector of
values the same length as |
output.domain |
The domain of values over which to estimate the
density. Defaults to |
na.rm |
Remove missing values if |
Only tractable for around 10,000 data points or less - otherwise consider using the density() core function for a close approximation.
The density() core function approximation is normally a very good approximation, but some small values close to zero may become zero rather than just very small. This makes it less suitable for mutual information estimation.
The exact kernel density estimate as a density
object,
compatible with R's density
function.
Alexander Lyulph Robert Lubbock, Ian Overton
A dataset containing synthetic example values to demonstrate functionality of 'eHDprep'
example_data
example_data
A data frame with 1,000 rows and 10 variables:
1 to 1000, effictively row numbers
double. random values with a mean of 50 and SD of 20
character. T stage random values
character. N stage random values
character. Patient diabetes category
character. Patient diabetes type category
character. Patient hypertension category
character. Patient domestic address category
character. Patient marital status category
character. Single Nucleotide Polymorphism (SNP) of the patient
character. Another SNP of the patient
character. sentences from the 'stringr' package as an example of short free text variables
synthetic
A data frame describing semantic links (edges) between entities in 'example_ontology'. Used to demonstrate semantic enrichment.
example_edge_tbl
example_edge_tbl
A data frame:
character. Names of semantic concepts which have a directed relationship to concepts in 'to' column.
character. Names of semantic concepts which have a directed relationship to concepts in 'from' column.
Used in documentation and creation of ‘example_ontology' in ’eHDPrep'.
synthetic
A data frame containing mappings between variables in 'example_data' and 'example_onto'. Used to demonstrate semantic enrichment.
example_mapping_file
example_mapping_file
A data frame:
character. names of variables in post-QC 'example_data'.
character. names of mapped entities in 'example_ontology'.
Maps variables in ‘example_data' to 'example_ontology' in ’eHDPrep'.
synthetic
A small custom network graph to demonstrate semantic enrichment.
example_ontology
example_ontology
tidygraph graph
Contains semantic links of variables in 'eHDPrep”s 'example_data' following quality control.
synthetic
Save dataset in .csv or .tsv format. A wrapper function for readr
's
write_csv
and write_tsv
.
export_dataset(x, file, format = "csv", ...)
export_dataset(x, file, format = "csv", ...)
x |
A data frame or tibble to write to disk. |
file |
File or connection to write to. |
format |
Character constant. "csv" (default) or "tsv" |
... |
x
saved to file
in selected format
Other import to/export from 'R' functions:
import_dataset()
data(example_data) tmp = tempfile(fileext = ".csv") export_dataset(example_data, tmp)
data(example_data) tmp = tempfile(fileext = ".csv") export_dataset(example_data, tmp)
Extracts information from specified free text variables (...
) which
occur in a minimum amount of rows (min_freq
) and appends new variables
to data
.
extract_freetext(data, id_var, min_freq = 1, ...)
extract_freetext(data, id_var, min_freq = 1, ...)
data |
Data frame to append skipgram variables to. |
id_var |
An unquoted expression which corresponds to a variable in
|
min_freq |
Minimum percentage frequency of skipgram occurrence to return. Default = 1. |
... |
Unquoted expressions of free text variable names from which to extract information. |
New variables report the presence of skipgrams (proximal words in the text)
with a minimum frequency (min_freq
, default = 1%)).
data
with additional Boolean variables describing skipgrams in
...
Guthrie, D., Allison, B., Liu, W., Guthrie, L. & Wilks, Y. A Closer Look at Skip-gram Modelling. in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (European Language Resources Association (ELRA), 2006).
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda: An R package for the quantitative analysis of textual data.” _Journal of Open Source Software_, *3*(30), 774. doi:10.21105/joss.00774 <https://doi.org/10.21105/joss.00774>, <https://quanteda.io>.
Feinerer I, Hornik K (2020). _tm: Text Mining Package_. R package version 0.7-8, <https://CRAN.R-project.org/package=tm>.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: https://www.jstatsoft.org/v25/i05/.
Principle underlying function: tokens_ngrams
Other free text functions:
skipgram_append()
,
skipgram_freq()
,
skipgram_identify()
data(example_data) extract_freetext(example_data, patient_id, min_freq = 0.6, free_text)
data(example_data) extract_freetext(example_data, patient_id, min_freq = 0.6, free_text)
Tests pairs of variables for consistency between their values according to a table of rules or 'consistency table'.
identify_inconsistency(data = NULL, consis_tbl = NULL, id_var = NULL)
identify_inconsistency(data = NULL, consis_tbl = NULL, id_var = NULL)
data |
data frame which will be checked for internal consistency |
consis_tbl |
data frame or tibble containing information on internal consistency rules (see "Consistency Table Requirements" section) |
id_var |
An unquoted expression which corresponds to a variable in
|
Multiple types of checks for inconsistency are supported:
Comparing by logical operators (<, <=, ==, !=, >=, >)
Comparing permitted categories (e.g. cat1 in varA only if cat2 in varB)
Comparing permitted numeric ranges (e.g. 20-25 in varC only if 10-20 in varD)
Mixtures of 2 and 3 (e.g. cat1 in varA only if 20-25 in varC)
The consistency tests rely on such rules being specified in a
separate data frame (consis_tbl
; see section "Consistency Table Requirements").
Variable A is given higher priority than Variable B when A is a category. If A (as char) is not equal to the value in col 4, the check is not made. This is to account for one way dependencies (i.e. VarA is fruit, VarB is apple)
tibble detailing any identified internal inconsistencies in
data
, if any are found. If no inconsistencies are found, data
is returned invisibly.
Table must have exactly five character columns. The columns should be ordered according to the list below which describes the values of each column:
First column name of data values that will be subject to consistency checking. String. Required.
Second column name of data values that will be subject to consistency checking. String. Required.
Logical test to compare columns one and two. One of: ">",">=",
"<","<=","==", "!=". String. Optional if columns 4 and 5 have non-NA
values.
Either a single character string or a colon-separated range of
numbers which should only appear in column A. Optional if column 3 has a
non-NA
value.
Either a single character string or a colon-separated range of
numbers which should only appear in column B given the value/range
specified in column 4. Optional if column 3 has a non-NA
value.
Each row should detail one test to make.
Therefore, either column 3 or columns 4 and 5 must contain non-NA
values.
Other internal consistency functions:
validate_consistency_tbl()
require(tibble) # example with synthetic dataset on number of bean counts # there is a lot going on in the function so a simple dataset aids this example # # creating `data`: beans <- tibble::tibble(red_beans = 1:15, blue_beans = 1:15, total_beans = 1:15*2, red_bean_summary = c(rep("few_beans",9), rep("many_beans",6))) # # creating `consis_tbl` bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries, "red_beans", "blue_beans", "==", NA, NA, "red_beans", "total_beans", "<=", NA,NA, "red_beans", "red_bean_summary", NA, "1:9", "few_beans", "red_beans", "red_bean_summary", NA, "10:15", "many_beans") identify_inconsistency(beans, bean_rules) # creating some inconsistencies as examples beans[1, "red_bean_summary"] <- "many_beans" beans[1, "red_beans"] <- 10 identify_inconsistency(beans, bean_rules)
require(tibble) # example with synthetic dataset on number of bean counts # there is a lot going on in the function so a simple dataset aids this example # # creating `data`: beans <- tibble::tibble(red_beans = 1:15, blue_beans = 1:15, total_beans = 1:15*2, red_bean_summary = c(rep("few_beans",9), rep("many_beans",6))) # # creating `consis_tbl` bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries, "red_beans", "blue_beans", "==", NA, NA, "red_beans", "total_beans", "<=", NA,NA, "red_beans", "red_bean_summary", NA, "1:9", "few_beans", "red_beans", "red_bean_summary", NA, "10:15", "many_beans") identify_inconsistency(beans, bean_rules) # creating some inconsistencies as examples beans[1, "red_bean_summary"] <- "many_beans" beans[1, "red_beans"] <- 10 identify_inconsistency(beans, bean_rules)
Imports a rectangular single table into R from a .xls, .xlsx, .csv, or .tsv file.
import_dataset(file, format = "excel", ...)
import_dataset(file, format = "excel", ...)
file |
Character constant. Path to file. |
format |
Character constant. "excel" (default, for .xls or.xlsx files), csv", or "tsv". |
... |
Parameters to pass to |
First row is interpreted as column headers by default. For more details see
read_excel
(.xlsx/.xls), read_csv
(.csv), or
read_tsv
(.tsv).
data as a tibble
read_excel
for additional parameters for
importing .xls or .xlsx files, read_csv
for .csv
files, read_tsv
for .tsv files
Other import to/export from 'R' functions:
export_dataset()
## Not run: # This code will not run as it requires an xlsx file # ./dataset.xlsx should be replaced with path to user's dataset # excel import_dataset(file = "./dataset.xlsx", format = "excel") #csv import_dataset(file = "./dataset.csv", format = "csv") #tsv import_dataset(file = "./dataset.tsv", format = "tsv") ## End(Not run)
## Not run: # This code will not run as it requires an xlsx file # ./dataset.xlsx should be replaced with path to user's dataset # excel import_dataset(file = "./dataset.xlsx", format = "excel") #csv import_dataset(file = "./dataset.csv", format = "csv") #tsv import_dataset(file = "./dataset.tsv", format = "tsv") ## End(Not run)
Reads in output of assume_var_classes
, ensures all specified
datatypes are one of ("id", "numeric", "double", "integer", "character",
"factor","ordinal", "genotype", "freetext", "logical") as required for high
level 'eHDPrep' functions.
import_var_classes(file = "./datatypes.csv")
import_var_classes(file = "./datatypes.csv")
file |
character string. Path to output of
|
data frame containing the data type values of variables, as described
in file
tmp = tempfile(fileext = ".csv") data(example_data) assume_var_classes(example_data, tmp) import_var_classes(tmp)
tmp = tempfile(fileext = ".csv") data(example_data) assume_var_classes(example_data, tmp) import_var_classes(tmp)
Calculates information content of a continuous (numeric) vector in bits (default) or natural units. Missing values are omitted from the calculation.
information_content_contin(x, unit = c("bits"))
information_content_contin(x, unit = c("bits"))
x |
Input vector |
unit |
Unit to measure entropy. Either "bits" (default) of "nats". |
Information content of input variable
data(example_data) information_content_contin(example_data$tumoursize)
data(example_data) information_content_contin(example_data$tumoursize)
Calculates information content of a discrete (categorical or ordinal) vector in bits (default) or natural units. Missing values are omitted from the calculation.
information_content_discrete(x, unit = c("bits"))
information_content_discrete(x, unit = c("bits"))
x |
Input vector |
unit |
Unit to measure entropy. Either "bits" (default) or "nats". |
Information content of input variable
data(example_data) information_content_discrete(example_data$marital_status)
data(example_data) information_content_discrete(example_data$marital_status)
This function creates new nodes representing dataset variables and joins them
to an input ontology network using a mapping file. Prior to joining, the
information content of all nodes is calculated using node_IC_zhou
.
join_vars_to_ontol(ontol_graph, var2entity_tbl, mode = "in", root, k = 0.5)
join_vars_to_ontol(ontol_graph, var2entity_tbl, mode = "in", root, k = 0.5)
ontol_graph |
Graph containing the chosen ontology. Must be in
|
var2entity_tbl |
Edge table containing dataset variable names in first column and entities in ontologies to which they are mapped in the second column. |
mode |
Character constant specifying the directionality of the edges. One of "in" or "out". |
root |
name of root node identifier in column 1 to calculate node depth from. |
k |
numeric value to adjust the weight of the two items of information content equation (relative number of hyponyms/descendants and relative node depth). Default = 0.5 |
The user-defined mappings between variables in a dataset and
entities/terms in an ontology are provided in an edge table
(var2entity_tbl
).
A node attribute column, node_category
is
generated to describe if a node is one of "Dataset Variable", "Annotation", or
"Annotation Ancestor".
A tidygraph
resulting from the joining of var2entity_tbl
and ontol_graph
.
node_IC_zhou
Other semantic enrichment functions:
metavariable_agg()
,
metavariable_info()
,
metavariable_variable_descendants()
data(example_ontology) join_vars_to_ontol(example_ontology, example_mapping_file, root = "root", mode = "in")
data(example_ontology) join_vars_to_ontol(example_ontology, example_mapping_file, root = "root", mode = "in")
This low-level function is deployed as part of the semantic enrichment
process.Calculates maximum of values in numeric vector (ignoring NAs). If all
values in input vector are NA
, returns NA
(rather than -Inf),
max_catchNAs(x)
max_catchNAs(x)
x |
numeric vector |
maximum value of x
This low-level function is deployed as part of the semantic enrichment
process. Averages values in numeric vector (ignoring NAs). If all values in
numeric vector are NA
, returns NA
(rather than NaN),
mean_catchNAs(x)
mean_catchNAs(x)
x |
numeric vector |
mean of x
Merges two columns in a single data frame. The merging draws on the
functionality of 'dplyr'
's coalesce
where missing
values from one vector are replaced by corresponding values in a second
variable. The name of the merged variable is specified in
merge_var_name
. primary_var
and secondary_var
can be
removed with rm_in_vars
. Variables must be combinable (i.e. not a
combination of numeric and character).
merge_cols( data, primary_var, secondary_var, merge_var_name = NULL, rm_in_vars = FALSE )
merge_cols( data, primary_var, secondary_var, merge_var_name = NULL, rm_in_vars = FALSE )
data |
data frame containing |
primary_var |
Data variable
which contains the best quality / most detailed information. Missing values
will be supplied by values in corresponding rows from |
secondary_var |
Data variable
which will be used to fill missing values in |
merge_var_name |
character constant. Name for merged variable. Default:
[ |
rm_in_vars |
logical constant. Should |
data frame with coalesced primary_var
and secondary_var
data(example_data) # preserve input variables (default) res <- merge_cols(example_data, diabetes_type, diabetes) dplyr::select(res, dplyr::starts_with("diabetes")) # remove input variables res <- merge_cols(example_data, diabetes_type, diabetes, rm_in_vars = TRUE) dplyr::select(res, dplyr::starts_with("diabetes"))
data(example_data) # preserve input variables (default) res <- merge_cols(example_data, diabetes_type, diabetes) dplyr::select(res, dplyr::starts_with("diabetes")) # remove input variables res <- merge_cols(example_data, diabetes_type, diabetes, rm_in_vars = TRUE) dplyr::select(res, dplyr::starts_with("diabetes"))
Variables in a numeric data frame are aggregated into metavariables via
their most informative common ancestors identified in an ontological graph
object (see metavariable_info
). Metavariables are appended to
the data frame.
metavariable_agg(graph, data, label_attr = "name", normalize_vals = TRUE)
metavariable_agg(graph, data, label_attr = "name", normalize_vals = TRUE)
graph |
Graph containing ontological and dataset nodes. Must be in
|
data |
Numeric data frame or matrix containing variables which are also
in |
label_attr |
Node attribute containing labels used for column names when creating metavariable aggregations. Default: "name" |
normalize_vals |
Should values be normalized before aggregation? Default: TRUE |
Metavariables are created from the aggregation of data variables via their
most informative common ancestor (expected to have been calculated in
metavariable_info
). Metavariables are labelled using the
syntax: MV_[label_attr]_[Aggregation function]
. The data variables are
aggregated row-wise by their maximum, minimum, mean, sum, and product.
Metavariables with zero entropy (no information) are not appended to the
data. See examples for where this function should be applied in the semantic
enrichment workflow.
data
with semantic aggregations derived from common
ontological ancestry (metavariables) appended as new columns, each
prefixed with "MV_" and suffixed by their aggregation function (e.g. "_SUM").
A warning may be shown regarding the '.add' argument being deprecated,
this is believed to be an issue with
tidygraph
which may be resolved in a
future release: <https://github.com/thomasp85/tidygraph/issues/131>.
Another warning may be shown regarding the 'neimode' argument being
deprecated, this is believed to be an issue with
tidygraph
which may be resolved in a
future release: <https://github.com/thomasp85/tidygraph/issues/156>. These
warning messages are not believed to have an effect on the functionality of
'eHDPrep'.
Other semantic enrichment functions:
join_vars_to_ontol()
,
metavariable_info()
,
metavariable_variable_descendants()
require(magrittr) require(dplyr) data(example_ontology) data(example_mapping_file) data(example_data) #' # define datatypes tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes_merged", "character", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # create post-QC data example_data %>% merge_cols(diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) %>% apply_quality_ctrl(patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), to_numeric_matrix = TRUE) %>% suppressMessages() -> post_qc_data # minimal example on first four coloums of example data: dplyr::slice(example_ontology, 1:7,24) %>% join_vars_to_ontol(example_mapping_file[1:3,], root = "root") %>% metavariable_info() %>% metavariable_agg(post_qc_data[1:10,1:4]) -> res # see Note section of documentation for information on possible warnings. # summary of result: tibble::glimpse(res) # full example: example_ontology %>% join_vars_to_ontol(example_mapping_file, root = "root") %>% metavariable_info() %>% metavariable_agg(post_qc_data) -> res # see Note section of documentation for information on possible warnings. # summary of result: tibble::glimpse(res)
require(magrittr) require(dplyr) data(example_ontology) data(example_mapping_file) data(example_data) #' # define datatypes tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes_merged", "character", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # create post-QC data example_data %>% merge_cols(diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) %>% apply_quality_ctrl(patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), to_numeric_matrix = TRUE) %>% suppressMessages() -> post_qc_data # minimal example on first four coloums of example data: dplyr::slice(example_ontology, 1:7,24) %>% join_vars_to_ontol(example_mapping_file[1:3,], root = "root") %>% metavariable_info() %>% metavariable_agg(post_qc_data[1:10,1:4]) -> res # see Note section of documentation for information on possible warnings. # summary of result: tibble::glimpse(res) # full example: example_ontology %>% join_vars_to_ontol(example_mapping_file, root = "root") %>% metavariable_info() %>% metavariable_agg(post_qc_data) -> res # see Note section of documentation for information on possible warnings. # summary of result: tibble::glimpse(res)
Calculates attributes for each node in a graph object pertaining to their
suitability and rank as metavariables; primarily if they are the most
informative common ancestor (see node_IC_zhou
) of a set of
nodes representing a dataset variable.
metavariable_info(graph, mode = "in", IC_threshold = 0)
metavariable_info(graph, mode = "in", IC_threshold = 0)
graph |
Graph containing ontological and dataset nodes. Must be in
|
mode |
Character constant specifying the directionality of the edges. One of: "in" or "out". |
IC_threshold |
Metavariables with IC less than this value will be omitted from output. Default = 0 (no omission). |
The added attributes are:
Integer. The minimum distance of an ontology node in the graph to a node representing a dataset variable.
Logical. If the node has at least two descendants in the graph which represent dataset variables.
List. The names of variables of which a node is an ancestor.
Integer. An identifier for the unique set of descendants in the graph which represent dataset variables. The assigned number corresponds to the order in which a unique set was identified when scanning through the node table.
Logical. If the node possesses the highest information
content of all other nodes which are common ancestors of the same variable
set. Information content is expected to have been calculated in
join_vars_to_ontol
.
A modified graph object with additional node attributes pertaining to their status as a metavariable.
Other semantic enrichment functions:
join_vars_to_ontol()
,
metavariable_agg()
,
metavariable_variable_descendants()
data(example_ontology) require(magrittr) example_ontology %>% join_vars_to_ontol(example_mapping_file, root = "root") -> joined_ontol metavariable_info(joined_ontol)
data(example_ontology) require(magrittr) example_ontology %>% join_vars_to_ontol(example_mapping_file, root = "root") -> joined_ontol metavariable_info(joined_ontol)
Formats the output of metavariable_info
for easier
interpretation of each metavariable's descendant variables
metavariable_variable_descendants(metavariable_info_output)
metavariable_variable_descendants(metavariable_info_output)
metavariable_info_output |
Output tibble of
|
Not part of the standard semantic enrichment pipeline as this function just
produces a simplified version of the output of metavariable_info
.
The output of metavariable_info
is converted to a tibble,
filtered to only include metavariables with highest information content for
the variable set. The tibble has three columns describing a metavariable, its
information content, and its descendant variables.
A tibble describing each metavariable, its information content, and its descendant variables
Other semantic enrichment functions:
join_vars_to_ontol()
,
metavariable_agg()
,
metavariable_info()
data(example_ontology) require(magrittr) example_ontology %>% join_vars_to_ontol(example_mapping_file, root = "root") -> joined_ontol mv_info <- metavariable_info(joined_ontol) metavariable_variable_descendants(mv_info)
data(example_ontology) require(magrittr) example_ontology %>% join_vars_to_ontol(example_mapping_file, root = "root") -> joined_ontol mv_info <- metavariable_info(joined_ontol) metavariable_variable_descendants(mv_info)
Calculates mutual information content between two variables in bits. Missing values are omitted from the calculation.
mi_content_discrete(x, y)
mi_content_discrete(x, y)
x |
First variable |
y |
Second variable |
Mutual information content of x
and y
data(example_data) mi_content_discrete(example_data$diabetes, example_data$diabetes_type)
data(example_data) mi_content_discrete(example_data$diabetes, example_data$diabetes_type)
This low-level function is deployed as part of the semantic enrichment
process. Calculates minimum of values in numeric vector (ignoring NAs). If
all values in numeric vector are NA
, returns NA
(rather than
Inf),
min_catchNAs(x)
min_catchNAs(x)
x |
numeric vector |
minimum value of x
This function produces a table where each row represents a value in a variable which is present in the cleaned dataset and which has been modified. The identifier, original and modified value, modification type, and variable names in the original and modified datasets are recorded.
mod_track(before_tbl, after_tbl, id_var, plot = FALSE, vars2compare)
mod_track(before_tbl, after_tbl, id_var, plot = FALSE, vars2compare)
before_tbl |
Data frame from before modifications were made. |
after_tbl |
Data frame from after modifications were made. |
id_var |
An unquoted expression which corresponds to a variable in both
|
plot |
Should a plot be returned instead of a table of results? Default:
|
vars2compare |
Character vectors of variable names to compare. |
Table containing row-level modification records or plot summarising modifications.
# merge data as the example modification require(magrittr) # example with one modification type (removal) # return table mod_track(example_data, strings_to_NA(example_data), patient_id) # return plot mod_track(example_data, strings_to_NA(example_data), patient_id, plot = TRUE) # example with multiple modification types (removal, substitution and addition) example_data %>% strings_to_NA() %>% merge_cols(diabetes_type, diabetes) -> modded_data # return table mod_track(example_data, modded_data, patient_id, vars2compare = c("t_stage", "diabetes_type_diabetes_merged" = "diabetes", "diabetes_type_diabetes_merged" = "diabetes_type"), plot = FALSE) # return plot mod_track(example_data, modded_data, patient_id, vars2compare = c("t_stage", "diabetes_type_diabetes_merged" = "diabetes", "diabetes_type_diabetes_merged" = "diabetes_type"), plot = TRUE)
# merge data as the example modification require(magrittr) # example with one modification type (removal) # return table mod_track(example_data, strings_to_NA(example_data), patient_id) # return plot mod_track(example_data, strings_to_NA(example_data), patient_id, plot = TRUE) # example with multiple modification types (removal, substitution and addition) example_data %>% strings_to_NA() %>% merge_cols(diabetes_type, diabetes) -> modded_data # return table mod_track(example_data, modded_data, patient_id, vars2compare = c("t_stage", "diabetes_type_diabetes_merged" = "diabetes", "diabetes_type_diabetes_merged" = "diabetes_type"), plot = FALSE) # return plot mod_track(example_data, modded_data, patient_id, vars2compare = c("t_stage", "diabetes_type_diabetes_merged" = "diabetes", "diabetes_type_diabetes_merged" = "diabetes_type"), plot = TRUE)
Computes the information content for each node in a directed graph according to the equation developed by Zhou et al. (2008).
node_IC_zhou(graph, mode = "in", root, k = 0.5)
node_IC_zhou(graph, mode = "in", root, k = 0.5)
graph |
|
mode |
Character constant specifying the directionality of the edges. One of "in" or "out". |
root |
name of root node identifier in column 1 to calculate node depth from. |
k |
numeric value to adjust the weight of the two items of information content equation (relative number of hyponyms/descendants and relative node depth). Default = 0.5 |
tidygraph with additional node attribute "information_content"
For use in semantic enrichment, this should be applied before joining
an ontology with nodes representing data variables (i.e. before applying
join_vars_to_ontol
.
Zhou, Z., Wang, Y. & Gu, J. A New Model of Information Content for Semantic Similarity in WordNet. in 2008 Second International Conference on Future Generation Communication and Networking Symposia vol. 3 85–89 (2008).
data(example_ontology) node_IC_zhou(example_ontology, mode = "in", root = "root")
data(example_ontology) node_IC_zhou(example_ontology, mode = "in", root = "root")
Normalizes values in x
to be between 0 and 1 using min-max
normalization.
normalize(x, na.rm = TRUE)
normalize(x, na.rm = TRUE)
x |
numeric vector |
na.rm |
a logical indicating whether missing values should be removed. Default = TRUE. |
normalised x
Replaces specified numbers in numeric columns with NA
.
nums_to_NA(data, ..., nums_to_replace = NULL)
nums_to_NA(data, ..., nums_to_replace = NULL)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
nums_to_replace |
numeric vector of values to be replaced with
|
Columns to process can be specified in ...
or the function will be
applied to all numeric columns.
data
with specified values replaced with NA
data(example_data) # replace all 1,2, and 3 from tumoursize and patient_id with NA. nums_to_NA(data = example_data, tumoursize, patient_id, nums_to_replace = c(1,2,3))
data(example_data) # replace all 1,2, and 3 from tumoursize and patient_id with NA. nums_to_NA(data = example_data, tumoursize, patient_id, nums_to_replace = c(1,2,3))
Uses one-hot encoding to convert nominal vectors to a tibble containing variables for each of the unique values in input vector.
onehot_vec(x, prefix)
onehot_vec(x, prefix)
x |
non-numeric vector |
prefix |
prefix to append to output variable names |
tibble
This function enables preservation of the text labels for ordinal variables in a dataset in preparation for conversion to a numeric matrix. A table is produced which retains the mappings between the text labels and the numerical labels for future reference.
ordinal_label_levels(data, out_path = NULL)
ordinal_label_levels(data, out_path = NULL)
data |
data frame with ordinal variables with labels and levels to be extracted. |
out_path |
Optional string. Path to write output to. If not supplied, R object will be returned. |
Tibble of text label and (numerical) level mappings
require(magrittr) # for %>% # create an example class_tbl object # note that diabetes_type is classed as ordinal yet is not modified as its # levels are not pre-coded. It should instead be encoded with encode_ordinals(). tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # show unqiue values for t_stage in pre-QC example_data unique(example_data$t_stage) # apply quality control to example_data apply_quality_ctrl(example_data, patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6) %>% ordinal_label_levels -> res # examine the labels and levels of t_stage in post-QC example_data dplyr::filter(res, variable == "t_stage")
require(magrittr) # for %>% # create an example class_tbl object # note that diabetes_type is classed as ordinal yet is not modified as its # levels are not pre-coded. It should instead be encoded with encode_ordinals(). tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # show unqiue values for t_stage in pre-QC example_data unique(example_data$t_stage) # apply quality control to example_data apply_quality_ctrl(example_data, patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6) %>% ordinal_label_levels -> res # examine the labels and levels of t_stage in post-QC example_data dplyr::filter(res, variable == "t_stage")
Generates a bar plot of percentage completeness for one or both data frame dimensions (rows/columns).
plot_completeness(data, id_var, plot = c("variables", "rows"))
plot_completeness(data, id_var, plot = c("variables", "rows"))
data |
Data frame in tidy format (see https://tidyr.tidyverse.org/). |
id_var |
Row identifier variable name. |
plot |
Character vector containing one or both of |
Completeness bar plot.
Other measures of completeness:
assess_completeness()
,
compare_completeness()
,
completeness_heatmap()
,
row_completeness()
,
variable_completeness()
data(example_data) plot_completeness(example_data, patient_id, "variables")
data(example_data) plot_completeness(example_data, patient_id, "variables")
This low-level function is deployed as part of the semantic enrichment
process. Calculates product of values in numeric vector (ignoring NAs). If
all values in numeric vector are NA
, returns NA
(rather than
Inf),
prod_catchNAs(x)
prod_catchNAs(x)
x |
numeric vector |
product of x
Reports if variables have been added, removed, or are preserved between two data frames. Intended to be used to review quality control / data preparation.
report_var_mods(before_tbl = NULL, after_tbl = NULL)
report_var_mods(before_tbl = NULL, after_tbl = NULL)
before_tbl |
Data frame from before modifications were made. |
after_tbl |
Data frame from after modifications were made. |
Tibble containing two columns. 'variable' contains name of each
variable. 'presence' contains the presence of the variable in
after_tbl
.
example_data_merged <- merge_cols(example_data, diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) report_var_mods(example_data, example_data_merged)
example_data_merged <- merge_cols(example_data, diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) report_var_mods(example_data, example_data_merged)
Provides information on modifications made to a dataset at both variable (column) and value (sample) levels, designed for review of quality control measures.
review_quality_ctrl(before_tbl, after_tbl, id_var)
review_quality_ctrl(before_tbl, after_tbl, id_var)
before_tbl |
Data frame from before modifications were made. |
after_tbl |
Data frame from after modifications were made. |
id_var |
An unquoted expression which corresponds to a variable in both
|
Modifications are identified by comparing the original and modified dataset.
QC review functions are applied in the following order:
Variable-level modifications (report_var_mods
)
Value-level modifications (mod_track
)
Value-level modifications (plot) (mod_track
)
A list containing each of these functions' outputs is returned.
List containing data for review of quality control
Other high level functionality:
apply_quality_ctrl()
,
assess_quality()
,
semantic_enrichment()
data(example_data) require(tibble) tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # create QC'ed dataset post_QC_example_data <- apply_quality_ctrl(example_data, patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6) # review QC QC_review <- review_quality_ctrl(before_tbl = example_data, after_tbl = post_QC_example_data, id_var = patient_id) # view variable level changes QC_review$variable_level_changes # view value level changes QC_review$value_level_changes # view value level changes as a plot QC_review$value_level_changes_plt
data(example_data) require(tibble) tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes", "factor", "diabetes_type", "ordinal", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # create QC'ed dataset post_QC_example_data <- apply_quality_ctrl(example_data, patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6) # review QC QC_review <- review_quality_ctrl(before_tbl = example_data, after_tbl = post_QC_example_data, id_var = patient_id) # view variable level changes QC_review$variable_level_changes # view value level changes QC_review$value_level_changes # view value level changes as a plot QC_review$value_level_changes_plt
Calculates the completeness of each row/observation in a data frame.
row_completeness(data, id_var)
row_completeness(data, id_var)
data |
Data frame. |
id_var |
Row identifier variable. |
Row completeness is measured by comparing the number of NA
to
non-NA
values. Returns the count of NA
as well as the
percentage of NA
values and the percentage completeness.
Tibble detailing completeness statistics for each row in input data.
Other measures of completeness:
assess_completeness()
,
compare_completeness()
,
completeness_heatmap()
,
plot_completeness()
,
variable_completeness()
data(example_data) row_completeness(example_data, patient_id)
data(example_data) row_completeness(example_data, patient_id)
Enriches a dataset with additional (meta-)variables derived from the semantic commonalities between variables (columns).
semantic_enrichment( data, ontology, mapping_file, mode = "in", root, label_attr = "name", ... )
semantic_enrichment( data, ontology, mapping_file, mode = "in", root, label_attr = "name", ... )
data |
Required. Numeric data frame or matrix containing variables present in the mapping file. |
ontology |
Required. One of:
. |
mapping_file |
Required. Path to csv file or data frame containing mapping information. Should contain two columns only. The first column should contain column names, present in the data frame. The second column should contain the name of entities present in the ontology object. |
mode |
Character constant specifying the directionality of the edges. One of: "in" or "out". |
root |
Required. Name of root node identifier in column 1 to calculate node depth from. |
label_attr |
Node attribute containing labels used for column names when creating metavariable aggregations. Default: "name" |
... |
additional arguments to pass to |
Semantic enrichment generates meta-variables from the aggregation of data
variables (columns) via their most informative common ancestor. Meta-variables are
labelled using the syntax: MV_[label_attr]_[Aggregation function]
. The
data variables are aggregated row-wise by their maximum, minimum, mean, sum,
and product. Meta-variables with zero entropy (no information) are not
appended to the data.
See the "Semantic Enrichment" section in the vignette of 'eHDPrep' for more
information: vignette("Introduction_to_eHDPrep", package = "eHDPrep")
Semantically enriched dataset
A warning may be shown regarding the '.add' argument being deprecated, this is believed to be an issue with 'tidygraph' which may be resolved in a future release: <https://github.com/thomasp85/tidygraph/issues/131>. Another warning may be shown regarding the 'neimode' argument being deprecated, this is believed to be an issue with 'tidygraph' which may be resolved in a future release: <https://github.com/thomasp85/tidygraph/issues/156>. These warning messages are not believed to have an effect on the functionality of 'eHDPrep'.
Other high level functionality:
apply_quality_ctrl()
,
assess_quality()
,
review_quality_ctrl()
require(magrittr) require(dplyr) data(example_ontology) data(example_mapping_file) data(example_data) #' # define datatypes tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes_merged", "character", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # create post-QC data example_data %>% merge_cols(diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) %>% apply_quality_ctrl(patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), to_numeric_matrix = TRUE) %>% suppressMessages() -> post_qc_data # minimal example on first four coloums of example data: semantic_enrichment(post_qc_data[1:10,1:4], dplyr::slice(example_ontology, 1:7,24), example_mapping_file[1:3,], root = "root") -> res # see Note section of documentation for information on possible warnings. # summary of result: tibble::glimpse(res) # full example: res <- semantic_enrichment(post_qc_data, example_ontology, example_mapping_file, root = "root") # see Note section of documentation for information on possible warnings.
require(magrittr) require(dplyr) data(example_ontology) data(example_mapping_file) data(example_data) #' # define datatypes tibble::tribble(~"var", ~"datatype", "patient_id", "id", "tumoursize", "numeric", "t_stage", "ordinal_tstage", "n_stage", "ordinal_nstage", "diabetes_merged", "character", "hypertension", "factor", "rural_urban", "factor", "marital_status", "factor", "SNP_a", "genotype", "SNP_b", "genotype", "free_text", "freetext") -> data_types # create post-QC data example_data %>% merge_cols(diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) %>% apply_quality_ctrl(patient_id, data_types, bin_cats =c("No" = "Yes", "rural" = "urban"), to_numeric_matrix = TRUE) %>% suppressMessages() -> post_qc_data # minimal example on first four coloums of example data: semantic_enrichment(post_qc_data[1:10,1:4], dplyr::slice(example_ontology, 1:7,24), example_mapping_file[1:3,], root = "root") -> res # see Note section of documentation for information on possible warnings. # summary of result: tibble::glimpse(res) # full example: res <- semantic_enrichment(post_qc_data, example_ontology, example_mapping_file, root = "root") # see Note section of documentation for information on possible warnings.
Adds new variables to data
which report the presence of skipgrams
(either those specified in skipgrams2append
or, if not specified,
skipgrams with a minimum frequency (min_freq
, default = 1)).
skipgram_append(skipgram_tokens, skipgrams2append, data, id_var, min_freq = 1)
skipgram_append(skipgram_tokens, skipgrams2append, data, id_var, min_freq = 1)
skipgram_tokens |
Output of |
skipgrams2append |
Which skipgrams in |
data |
Data frame to append skipgram variables to. |
id_var |
An unquoted expression which corresponds to a variable in
|
min_freq |
Minimum percentage frequency of skipgram occurrence to return. Default = 1. |
data
with additional variables describing presence of
skipgrams
Guthrie, D., Allison, B., Liu, W., Guthrie, L. & Wilks, Y. A Closer Look at Skip-gram Modelling. in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (European Language Resources Association (ELRA), 2006).
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda: An R package for the quantitative analysis of textual data.” _Journal of Open Source Software_, *3*(30), 774. doi:10.21105/joss.00774 <https://doi.org/10.21105/joss.00774>, <https://quanteda.io>.
Feinerer I, Hornik K (2020). _tm: Text Mining Package_. R package version 0.7-8, <https://CRAN.R-project.org/package=tm>.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: https://www.jstatsoft.org/v25/i05/.
Principle underlying function: tokens_ngrams
Other free text functions:
extract_freetext()
,
skipgram_freq()
,
skipgram_identify()
data(example_data) # identify skipgrams toks_m <- skipgram_identify(x = example_data$free_text, ids = example_data$patient_id, max_interrupt_words = 5) # add skipgrams by minimum frequency skipgram_append(toks_m, id_var = patient_id, min_freq = 0.6, data = example_data) # add specific skipgrams skipgram_append(toks_m, id_var = patient_id, skipgrams2append = c("sixteen_week", "bad_strain"), data = example_data)
data(example_data) # identify skipgrams toks_m <- skipgram_identify(x = example_data$free_text, ids = example_data$patient_id, max_interrupt_words = 5) # add skipgrams by minimum frequency skipgram_append(toks_m, id_var = patient_id, min_freq = 0.6, data = example_data) # add specific skipgrams skipgram_append(toks_m, id_var = patient_id, skipgrams2append = c("sixteen_week", "bad_strain"), data = example_data)
Measures the frequency of skipgrams (non-contiguous words in free text), reported in a tibble. Frequency is reported as both counts and percentages.
skipgram_freq(skipgram_tokens, min_freq = 1)
skipgram_freq(skipgram_tokens, min_freq = 1)
skipgram_tokens |
Output of |
min_freq |
Minimum skipgram percentage frequency of occurrence to retain. Default = 1. |
Data frame containing frequency of skipgrams in absolute count and relative to the length of input variable.
Guthrie, D., Allison, B., Liu, W., Guthrie, L. & Wilks, Y. A Closer Look at Skip-gram Modelling. in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (European Language Resources Association (ELRA), 2006).
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda: An R package for the quantitative analysis of textual data.” _Journal of Open Source Software_, *3*(30), 774. doi:10.21105/joss.00774 <https://doi.org/10.21105/joss.00774>, <https://quanteda.io>.
Feinerer I, Hornik K (2020). _tm: Text Mining Package_. R package version 0.7-8, <https://CRAN.R-project.org/package=tm>.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: https://www.jstatsoft.org/v25/i05/.
Principle underlying function: tokens_ngrams
Other free text functions:
extract_freetext()
,
skipgram_append()
,
skipgram_identify()
data(example_data) toks_m <- skipgram_identify(x = example_data$free_text, ids = example_data$patient_id, max_interrupt_words = 5) skipgram_freq(toks_m, min_freq = 0.5)
data(example_data) toks_m <- skipgram_identify(x = example_data$free_text, ids = example_data$patient_id, max_interrupt_words = 5) skipgram_freq(toks_m, min_freq = 0.5)
Identifies words which appear near each other in the free-text variable
(var
), referred to as "Skipgrams". Supported languages for stop words
and stemming are danish
, dutch
, english
, finnish
,
french
, german
, hungarian
, italian
,
norwegian
, portuguese
, russian
, spanish
, and
swedish
.
skipgram_identify( x, ids, num_of_words = 2, max_interrupt_words = 2, words_to_rm = NULL, lan = "english" )
skipgram_identify( x, ids, num_of_words = 2, max_interrupt_words = 2, words_to_rm = NULL, lan = "english" )
x |
Free-text character vector to query. |
ids |
Character vector containing IDs for each element of |
num_of_words |
Number of words to consider for each returned skipgram. Default = 2. |
max_interrupt_words |
Maximum number of words which can interrupt proximal words. Default = 2. |
words_to_rm |
Character vector of words which should not be considered. |
lan |
Language of |
Tibble containing skipgrams as variables and patient values as rows.
Guthrie, D., Allison, B., Liu, W., Guthrie, L. & Wilks, Y. A Closer Look at Skip-gram Modelling. in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (European Language Resources Association (ELRA), 2006).
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda: An R package for the quantitative analysis of textual data.” _Journal of Open Source Software_, *3*(30), 774. doi:10.21105/joss.00774 <https://doi.org/10.21105/joss.00774>, <https://quanteda.io>.
Feinerer I, Hornik K (2020). _tm: Text Mining Package_. R package version 0.7-8, <https://CRAN.R-project.org/package=tm>.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: https://www.jstatsoft.org/v25/i05/.
Principle underlying function: tokens_ngrams
Other free text functions:
extract_freetext()
,
skipgram_append()
,
skipgram_freq()
data(example_data) skipgram_identify(x = example_data$free_text, ids = example_data$patient_id, max_interrupt_words = 5)
data(example_data) skipgram_identify(x = example_data$free_text, ids = example_data$patient_id, max_interrupt_words = 5)
Replaces specified or pre-defined strings in non-numeric columns with
NA
.
strings_to_NA(data, ..., strings_to_replace = NULL)
strings_to_NA(data, ..., strings_to_replace = NULL)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
strings_to_replace |
character vector of values to be replaced with
|
Columns to process can be specified in custom arguments (...
) or will
be applied to all non-numeric columns.
Default strings which will be replaced with NA
are as follows:
"Undetermined", "unknown", "missing", "fail", "fail / unknown",
"equivocal", "equivocal / unknown", "*".
String search is made using grepl
and supports
regex
so metacharacters (. \ | ( ) [ ] { } ^ $ * + ? $
)
should be escaped with a "\
" prefix.
Matches are case sensitive by default but can ignore case with the parameter:
ignore.case = TRUE
in ...
).
data with specified values replaced with NA.
data(example_data) # original unique values in diabetes column: unique(example_data$diabetes) # Using default values res <- strings_to_NA(example_data) unique(res$diabetes) # original unique values in diabetes_type column: unique(example_data$diabetes_type) # Using custom values res <- strings_to_NA(example_data, strings_to_replace = "Type I") unique(res$diabetes_type)
data(example_data) # original unique values in diabetes column: unique(example_data$diabetes) # Using default values res <- strings_to_NA(example_data) unique(res$diabetes) # original unique values in diabetes_type column: unique(example_data$diabetes_type) # Using custom values res <- strings_to_NA(example_data, strings_to_replace = "Type I") unique(res$diabetes_type)
sums values in x (ignoring NAs). If all values in x are NA
, returns
NA
(rather than 0),
sum_catchNAs(x)
sum_catchNAs(x)
x |
numeric vector |
sum of x
Runs a series of checks on a table of internal consistency rules
(see Consistency Table Requirements) in preparation for identify_inconsistency
.
validate_consistency_tbl(data, consis_tbl)
validate_consistency_tbl(data, consis_tbl)
data |
data frame which will be checked for internal consistency |
consis_tbl |
data frame or tibble containing information on internal consistency rules (see "Consistency Table Requirements" section) |
Error message or successful validation message is printed. The dataset is returned invisibly.
Table must have exactly five character columns. The columns should be ordered according to the list below which describes the values of each column:
First column name of data values that will be subject to consistency checking. String. Required.
Second column name of data values that will be subject to consistency checking. String. Required.
Logical test to compare columns one and two. One of: ">",">=",
"<","<=","==", "!=". String. Optional if columns 4 and 5 have non-NA
values.
Either a single character string or a colon-separated range of
numbers which should only appear in column A. Optional if column 3 has a
non-NA
value.
Either a single character string or a colon-separated range of
numbers which should only appear in column B given the value/range
specified in column 4. Optional if column 3 has a non-NA
value.
Each row should detail one test to make.
Therefore, either column 3 or columns 4 and 5 must contain non-NA
values.
Other internal consistency functions:
identify_inconsistency()
require(tibble) # example with synthetic dataset on number of bean counters # there is a lot going on in the function so a simple dataset aids this example # # creating `data`: beans <- tibble::tibble(red_beans = 1:15, blue_beans = 1:15, total_beans = 1:15*2, red_bean_summary = c(rep("few_beans",9), rep("many_beans",6))) # # creating `consis_tbl` bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries, "red_beans", "blue_beans", "==", NA, NA, "red_beans", "total_beans", "<=", NA,NA, "red_beans", "red_bean_summary", NA, "1:9", "few_beans", "red_beans", "red_bean_summary", NA, "10:15", "many_beans") validate_consistency_tbl(beans, bean_rules)
require(tibble) # example with synthetic dataset on number of bean counters # there is a lot going on in the function so a simple dataset aids this example # # creating `data`: beans <- tibble::tibble(red_beans = 1:15, blue_beans = 1:15, total_beans = 1:15*2, red_bean_summary = c(rep("few_beans",9), rep("many_beans",6))) # # creating `consis_tbl` bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries, "red_beans", "blue_beans", "==", NA, NA, "red_beans", "total_beans", "<=", NA,NA, "red_beans", "red_bean_summary", NA, "1:9", "few_beans", "red_beans", "red_bean_summary", NA, "10:15", "many_beans") validate_consistency_tbl(beans, bean_rules)
Applies tests to a mapping table to ensure it is valid for use with the data frame and ontological graph, in preparation for semantic enrichment.
validate_mapping_tbl(mapping_tbl, data, ontol_graph)
validate_mapping_tbl(mapping_tbl, data, ontol_graph)
mapping_tbl |
data frame. Contains two columns. First column contains variable names of a primary dataset. Second column contains entities in an ontological graph to which the primary dataset's variable names are mapped. |
data |
data frame. Primary dataset which contains variable names referred to in first column of the mapping table |
ontol_graph |
ontological graph which contains entity names/IDs referred to in second column of the mapping table |
Any warnings and the mapping table returned invisibly
Performs tests on a graph object in preparation for semantic enrichment.
validate_ontol_nw(graph)
validate_ontol_nw(graph)
graph |
graph object to validate. |
The tests are:
Is graph coercible to tidygraph
format?
Is graph directed?
Does graph contains one component (is one ontology)?
input graph or validation errors
Calculates the completeness of each variable in a data frame.
variable_completeness(data)
variable_completeness(data)
data |
Data frame. |
This is achieved by comparing the number of NA
to non-NA
values. Returns the count of NA
as well as the percentage of NA
values and the percentage completeness.
Tibble
detailing completeness statistics for each variable.
Other measures of completeness:
assess_completeness()
,
compare_completeness()
,
completeness_heatmap()
,
plot_completeness()
,
row_completeness()
data(example_data) variable_completeness(example_data)
data(example_data) variable_completeness(example_data)
Calculates Shannon entropy of all variables in a data frame in bits (default) or natural units. Missing values are omitted from the calculation.
variable_entropy(data, unit = "bits")
variable_entropy(data, unit = "bits")
data |
Data Frame to compute on |
unit |
Unit to measure entropy. Either "bits" (default) or "nats". |
Named numeric vector containing entropy values
Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423 (1948).
a <- matrix(c(c(1,1,1,1,1,1, 1,2,3,4,5,6)),ncol = 2, dimnames = list(seq(1,6), c("no_entropy","entropy"))) variable_entropy(as.data.frame(a))
a <- matrix(c(c(1,1,1,1,1,1, 1,2,3,4,5,6)),ncol = 2, dimnames = list(seq(1,6), c("no_entropy","entropy"))) variable_entropy(as.data.frame(a))
Calculates variable bandwidth KDE using Abramson's two stage estimator.
variable.bw.kde(x, output.domain = x, na.rm = FALSE, adjust.factor = 0.5)
variable.bw.kde(x, output.domain = x, na.rm = FALSE, adjust.factor = 0.5)
x |
A numeric vector of values for estimating density |
output.domain |
The domain of values over which to estimate the
density. Defaults to |
na.rm |
Remove missing values if TRUE |
adjust.factor |
A scaling factor (exponent) applied to the variable bandwidth calculation. Larger factors result in greater deviation from the fixed bandwidth (a value of 0 gives the fixed bandwidth case). |
Bandwidth is first calculated using Silverman's estimator, then refined in a second stage to allow local bandwidth variations in the data based on the initial estimate.
The kernel density estimate as a density
object, compatible
with R's density
function.
Alexander Lyulph Robert Lubbock, Ian Overton
Abramson, I. S. On Bandwidth Variation in Kernel Estimates-A Square Root Law. Ann. Statist. 10, 1217-1223 (1982).
Internal function. Warns if dots (...) argument have not been supplied
warn_missing_dots(test)
warn_missing_dots(test)
test |
expression to test. |
warning to user that no values were modified
Calculates Shannon entropy of variables in a data frame in bits (default) or natural units. Missing values are omitted from the calculation. Names of variables with zero entropy are returned.
zero_entropy_variables(data, unit = "bits")
zero_entropy_variables(data, unit = "bits")
data |
Data Frame to compute on |
unit |
Unit to measure entropy. Either "bits" (default) or "nats". |
Character vector of variable names with zero entropy
Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423 (1948).
data(example_data) zero_entropy_variables(example_data)
data(example_data) zero_entropy_variables(example_data)