Title: | Interactive and Reproducible Data Cleaning |
---|---|
Description: | Flexible and efficient cleaning of data with interactivity. 'datacleanr' facilitates best practices in data analyses and reproducibility with built-in features and by translating interactive/manual operations to code. The package is designed for interoperability, and so seamlessly fits into reproducible analyses pipelines in 'R'. |
Authors: | Alexander Hurley [cre, aut, cph] , Richard Peters [ctb] , Christoforos Pappas [ctb] |
Maintainer: | Alexander Hurley <[email protected]> |
License: | GPL-3 |
Version: | 1.0.4 |
Built: | 2024-11-01 03:24:22 UTC |
Source: | https://github.com/the-hull/datacleanr |
Applies grouping to data set conditionally
apply_data_set_up(df, group)
apply_data_set_up(df, group)
df |
data frame |
group |
supply reactive output from group selector |
returns df either grouped or not
Used for adjusting layout of plotly plot based on selected
groups in group_selector_table
; currently used in viz tab
calc_limits_per_groups(dframe, group_index, xvar, yvar, scaling = 0.02)
calc_limits_per_groups(dframe, group_index, xvar, yvar, scaling = 0.02)
dframe |
dataframe/tibble, grouped/ungrouped |
group_index |
numeric, group indices for which to return lims |
xvar |
character, name of x var for plot (must exist in dframe) |
yvar |
character, name of y var for plot (must exist in dframe) |
scaling |
numeric, 1 +/- |
list with xlim and ylim
Check for internet connection
can_internet(url = "http://www.google.com")
can_internet(url = "http://www.google.com")
url |
character, valid path to url - user responsible |
logical - TRUE or FALSE
check if a filter statement is valid
check_individual_statement(df, statement)
check_individual_statement(df, statement)
df |
data frame / tibble to be filtered |
statement |
character string, |
logical, did filter statement work?
datacleanr server function
datacleanr_server(input, output, session, dataset, df_name, is_on_disk)
datacleanr_server(input, output, session, dataset, df_name, is_on_disk)
input , output , session
|
standard |
dataset |
data.frame, tibble or data.table that needs cleaning |
df_name |
character, name of dataset or file_path passed into shiny app |
is_on_disk |
logical, whether df was read from file |
Launches the datacleanr
app for interactive and reproducible cleaning.
See Details for more information.
dcr_app(dframe, browser = TRUE)
dcr_app(dframe, browser = TRUE)
dframe |
Character, a string naming a |
browser |
logical, should app start in OS's default browser? (default |
datacleanr
provides an interactive data overview, and allows
reproducible filtering and (manual, interactive) visual outlier detection and annotation across multiple app tabs:
Overview and Set-up: set groups (see below) and generate a exploratory summary of dframe
Filtering: Provide and apply filter statements (groupwise, see below and filter_scoped_df
)
Visualization and Annotating: interactive visualization allowing outlier highlighting, annotating and before/after histograms of displayed (numeric) variables
Extraction: generates Reproducible Recipe and outputs
For data sets exceeding 1.5 million rows, we suggest splitting the data, if possible, by a grouping factor.
This is because at this volume interactive visualizations using plotly
stretch the limits of what modern web browsers can handle.
A simple example using iris
is:
iris_split <- split(iris, iris$Species) dcr_app(iris_split[[1]]) # or lapply(iris_split, dcr_app)
Extensive documentation is provided on each of the tabs for individual procedures in help links.
datacleanr
relies on 1) generating a column of unique IDs (.dcrkey
) and subsetting dframe
into sub-groups (generated in-app,
added as column .dcrindex
) for filtering and visualization.
These groups are composed of unique combinations of columns in the data set (must be factor
) and are passed to group_by
,
and are carried through the app for exploratory analyses (tab Overview and Set-up), filtering (tab Filtering) and plotting
(tab Visualization).
These groups should ideally be chosen to facilitate a convenient filtering and viewing/cleaning process.
For example, a data set with time series of multiple sensors could be grouped by sensor and/or additional columns,
such that periods of interest can be visualized and cleaned simultaneously in the interactive plot.
Filtering is achieved by providing expressions that evaluate to TRUE
\ FALSE
, and can be applied to the entire
data set, or individual/all groups via scoped filtering (see filter_scoped_df
).
The interactive visualization allows selecting and deselecting points with lasso and box select tools, as well as interactive zooming (toolbar or clicking on legend items or group overview table, see tab in-app) as well as panning (toolbar and hover over plot's axes). Data formats supported are
Observational (numeric), timeseries (POSIXct
) and categorical data in x
and y
dimensions/axis
Observational (numeric) data in z
dimension (point size)
Spatial data, when lon
and lat
in decimal degrees are present in x
and y
.
Displaying spatial data requires a Mapbox account, from which an access token needs
to be copied into your .Renviron
(e.g. MAPBOX_TOKEN=your_copied_token
).
Note, that when a column .dcrflag
(logical, TRUE
\ FALSE
) is present in dframe
,
respective observations are given contrasting
symbols (FALSE
= circle, TRUE
= star-triangle).
This column is employed as a cross-referencing tool for e.g.other outlier detection or data-processing algorithms
that were applied prior.
The tab Extraction provides code to reproduce the entire procedure (a Reproducible Recipe), which
can be copied, or sent directly to an active RStudio
script when used interactively (i.e. when dframe
is an object in R
's
environment),
can be saved to disk with intermediate outputs (filter statements and selected outliers),
where file names are based on the input file and configurable suffixes when dframe
is a path.
When datacleanr
is ended by clicking on Close
in the app's navigation bar, a list is invisibly returned
with the following items:
df_name: character, object name/file path passed into dcr_app
dcr_df: tibble, filtered data set with additional columns .dcrkey
, .dcrindex
, .annotation
- the latter is NA
for non-outliers, an empty string for outliers without annotation, and a custom string for annotated outliers
dcr_selected_outliers: data.frame, contains the outlier .dcrkey
, the .annotation
and a selection_count
(integer, count incrementer) column
dcr_groups: character, a vector defining the groups (via group_by
) used throughout datacleanr
dcr_condition_df: tibble, with columns filter
(character, statement used for filtering) and group
(list, of integers), defining groups that correspond to .dcrindex
dcr_code: character string, containing Reproducible Recipe
Initial checks for data set
dcr_checks(dframe)
dcr_checks(dframe)
dframe |
dframe supplied to |
extend brewer palette
extend_palette(n)
extend_palette(n)
n |
numeric, number of colors |
color vector of length n
dplyr
groupsApply filter based on a statement, scoped to dplyr
groups
filter_scoped(dframe, statement, scope_at = NULL)
filter_scoped(dframe, statement, scope_at = NULL)
dframe |
data.frame/tbl, grouped or ungrouped |
statement |
character, statement for filtering (only VALID expressions; use |
scope_at |
numeric, group indices to apply filter statements to |
List, containing item filtered_df
, a data.frame
filtered based on statements and scope.
dplyr
-groupwisefilter_scoped_df
subsets rows of a data frame based on grouping structure
(see group_by
). Filtering statements are provided in a separate tibble
where each row represents a combination of a logical expression and a list of groups
to which the expression should be applied to corresponding to see indices from
cur_group_id
).
filter_scoped_df(dframe, condition_df)
filter_scoped_df(dframe, condition_df)
dframe |
A grouped or ungrouped |
condition_df |
A |
This function is applied in the "Filtering" tab of the datacleanr
app,
and applied in the reproducible code recipe in the "Extract" tab.
Note, that multiple checks for valid statements are performed in the app (and only valid operations
printed in the "Extract" tab). It is therefore not advisable to manually alter this code or use
this function interactively.
An object of the same type as dframe
. The output is a subset of
the input, with groups and rows appearing in the same order, and an additional column
.dcrindex
representing the group indices.
The output may have less groups as the input, depending on subsetting.
# set-up condition_df cdf <- dplyr::tibble( statement = c( "Sepal.Width > quantile(Sepal.Width, 0.1)", "Petal.Width > quantile(Petal.Width, 0.1)", "Petal.Length > quantile(Petal.Length, 0.8)" ), scope_at = list(NULL, NULL, c(1, 2)) ) fdf <- filter_scoped_df( dplyr::group_by( iris, Species ), condition_df = cdf ) # Example of invalid expression: # column 'Spec' does not exist in iris # "Spec == 'setosa'"
# set-up condition_df cdf <- dplyr::tibble( statement = c( "Sepal.Width > quantile(Sepal.Width, 0.1)", "Petal.Width > quantile(Petal.Width, 0.1)", "Petal.Length > quantile(Petal.Length, 0.8)" ), scope_at = list(NULL, NULL, c(1, 2)) ) fdf <- filter_scoped_df( dplyr::group_by( iris, Species ), condition_df = cdf ) # Example of invalid expression: # column 'Spec' does not exist in iris # "Spec == 'setosa'"
Identify columns carrying non-numeric values
get_factor_cols_idx(x)
get_factor_cols_idx(x)
x |
data.frame |
logical, is column in x non-numeric?
Single outlier trace is added to plotly; interactive select/deselect
was implemented by adjusting selected_points
, and subsequently adding, or deleting+adding
the (modified) trace at the end of the existing JS data array. Requires tracemap with
trace names and corresponding indices.
Simple check for re-execution was implemented by passing on the selection keys to compare against
on pertinent plotly_event
.
handle_add_outlier_trace( sp, dframe, ok, selectors, trace_map, source = "scatterselect", session )
handle_add_outlier_trace( sp, dframe, ok, selectors, trace_map, source = "scatterselect", session )
sp |
selected points |
dframe |
plot data |
ok |
reactive, old keys |
selectors |
reactive input selectors |
trace_map |
numeric, max trace id |
source |
plotly source |
session |
active session |
Wrapper for adjusting axis lims and hiding traces
handle_restyle_traces( source_id, session, dframe, scaling = 0.05, xvar, yvar, trace_map, max_id_group_trace, input_sel_rows, flush = TRUE )
handle_restyle_traces( source_id, session, dframe, scaling = 0.05, xvar, yvar, trace_map, max_id_group_trace, input_sel_rows, flush = TRUE )
source_id |
character, plotly source id |
session |
session object |
dframe |
data frame/tibble (grouped/ungrouped) |
scaling |
numeric, 1 +/- scaling applied to x lims for xvar and yvar |
xvar |
character, name of xvar, must be in dframe |
yvar |
character, name of yvar, must be in dframe |
trace_map |
matrix, with columns for trace name (col 1) and trace id (col 2) |
max_id_group_trace |
numeric, max id of plotly trace from original data (not outlier traces) |
input_sel_rows |
numeric, input from DT grouptable |
flush |
character, |
Used for it's side effect - no return
Handle selection of outliers (with select - unselect capacity)
handle_sel_outliers(sel_old_df, sel_new)
handle_sel_outliers(sel_old_df, sel_new)
sel_old_df |
data.frame of selection info |
sel_new |
data.frame, event data from plotly, must have column |
updated selection data frame
Provide trace ids to set to invisible
hide_trace_idx(trace_map, max_groups, selected_groups)
hide_trace_idx(trace_map, max_groups, selected_groups)
trace_map |
matrix, with cols trace name (col 1), trace id (col 2) |
max_groups |
numeric, number of groups in grouptable |
selected_groups |
groups highlighted in grouptable |
Provides the indices (JS notation, starting at 0) for indices
that are set to visible = 'legendonly'
through plotly.restyle
Make grouping overview table
make_group_table(dframe)
make_group_table(dframe)
dframe |
data.frame |
tibble with one row per group
Wrapper for saving files
make_save_filepath(save_dir, input_filepath, suffix, ext)
make_save_filepath(save_dir, input_filepath, suffix, ext)
save_dir |
character, selected save dir |
input_filepath |
character, original file path to folder |
suffix |
character, e.g. 'CLEAN' or 'cleaning_script' |
ext |
character, file extension, no dot!! |
OS-conform file path for saving
Server Module: apply / reset filter
module_server_apply_reset(input, output, session, df_filtered, df_original)
module_server_apply_reset(input, output, session, df_filtered, df_original)
input , output , session
|
standard |
df_filtered |
reactive, filtered df |
df_original |
reactive, original df |
Server Module: box for str filter condition
module_server_box_str_filter(input, output, session, selector, actionbtn)
module_server_box_str_filter(input, output, session, selector, actionbtn)
input , output , session
|
standard |
selector |
character, html selector for placement |
actionbtn |
reactive, action button counter |
Server Module: checkbox rendering
module_server_checkbox(input, output, session, text)
module_server_checkbox(input, output, session, text)
input , output , session
|
standard |
text |
Character, appears next to checkbox (or coerced) |
Server Module: filter info text and filtered df output
module_server_df_filter(input, output, session, dframe, condition_df)
module_server_df_filter(input, output, session, dframe, condition_df)
input , output , session
|
standard |
dframe |
data frame/tibble for filtering |
condition_df |
data frame/tibble with filtering conditions and grouping scope |
df, either filtered or original, based on validity of statements
in condition_df
Server Module: Selection Annotator
module_server_extract_code( input, output, session, df_label, filter_df, gvar, statements, sel_points, overwrite, is_on_disk, out_path )
module_server_extract_code( input, output, session, df_label, filter_df, gvar, statements, sel_points, overwrite, is_on_disk, out_path )
input , output , session
|
standard |
df_label |
string, name of original df input |
filter_df |
reactiveValue data frame with filter statements and scoping lvl |
gvar |
reactive character, grouping vars for |
statements |
reactive, lgl, vector of working statements |
sel_points |
reactiveValue, data frame with selected point keys, annotations, and selection count |
overwrite |
reacive value, TRUE/FALSE from checkbox input |
is_on_disk |
Logical, whether df represented by |
out_path |
reactive, List, with character strings providing directory paths and file names for saving/reading in code output |
Server Module: Extraction File selection menu
module_server_extract_code_fileconfig( input, output, session, df_label, is_on_disk, has_processed )
module_server_extract_code_fileconfig( input, output, session, df_label, is_on_disk, has_processed )
input , output , session
|
standard |
df_label |
character, name of original df input |
is_on_disk |
Logical, whether df represented by |
has_processed |
reactive, logical, TRUE if filtered / selected points |
Server Module: box for str filter condition
module_server_filter_str(input, output, session, dframe)
module_server_filter_str(input, output, session, dframe)
input , output , session
|
standard |
dframe |
data frame passed into dcr app |
provides UI text box element
Server Module: Selection Annotator
module_server_group_relayout_buttons(input, output, session, startscatter)
module_server_group_relayout_buttons(input, output, session, startscatter)
input , output , session
|
standard |
startscatter |
reactive, actionbutton value |
provides UI text box element
reactive values with input xvar, yvar and actionbutton counter
Server Module: group selection
module_server_group_select(input, output, session, dframe)
module_server_group_select(input, output, session, dframe)
input , output , session
|
standard |
dframe |
data frame for filtering |
Server Module: box for str filter condition
module_server_group_selector_table(input, output, session, df, df_label, ...)
module_server_group_selector_table(input, output, session, df, df_label, ...)
input , output , session
|
standard |
df |
data frame (either from overview or filtering tab) |
df_label |
character, original input data frame |
... |
arguments passed to |
provides UI text box element
Server Module: dynamic histogram output for n vars str filter condition
module_server_histograms( input, output, session, dframe, selector_inputs, sel_points )
module_server_histograms( input, output, session, dframe, selector_inputs, sel_points )
input , output , session
|
standard |
dframe |
df |
selector_inputs |
reactive vals from above-plot controls, |
sel_points |
reactive, provides .dcrkey of selected points |
provides UI buttons for deleting last / entire outlier selection
reactive values with input xvar, yvar and actionbutton counter
Server Module: box for str filter condition
module_server_lowercontrol_btn( input, output, session, selector_inputs, action_track )
module_server_lowercontrol_btn( input, output, session, selector_inputs, action_track )
input , output , session
|
standard |
selector_inputs |
reactive vals from above-plot controls, used to determine if plot is a map (lon/lat) |
action_track |
reactive, logical - has plot been pressed? |
provides UI buttons for deleting last / entire outlier selection
reactive values with input xvar, yvar and actionbutton counter
Server Module: DT for annotation
module_server_plot_annotation_table(input, output, session, dframe, sel_points)
module_server_plot_annotation_table(input, output, session, dframe, sel_points)
input , output , session
|
standard |
dframe |
df used for plotting |
sel_points |
numeric, vector of .dcrkeys selected in plot |
df with .dcrkeys and annotations
Server Module: box for str filter condition
module_server_plot_selectable( input, output, session, selector_inputs, df, sel_points, mapstyle )
module_server_plot_selectable( input, output, session, selector_inputs, df, sel_points, mapstyle )
input , output , session
|
standard |
selector_inputs |
reactive, output from module_plot_selectorcontrols |
df |
reactive df |
sel_points |
reactive, provides .dcrkey of selected points |
mapstyle |
reactive, selected mapstyle from below-plot controls |
provides plot, note, that data set needs a column .dcrkey, added in initial processing step
Server Module: box for str filter condition
module_server_plot_selectorcontrols(input, output, session, df)
module_server_plot_selectorcontrols(input, output, session, df)
input , output , session
|
standard |
df |
df (not reactive - prevent re-execution of observer) |
provides UI text box element
reactive values with input xvar, yvar and actionbutton counter
Server Module: data summary
module_server_summary( input, output, session, dframe, df_label, start_clicked, group_var_check )
module_server_summary( input, output, session, dframe, df_label, start_clicked, group_var_check )
input , output , session
|
standard |
dframe |
reactive, input data frame |
df_label |
character, name of initial data set |
start_clicked |
reactive holding start action button |
group_var_check |
reactive holding group check output |
Server Module: Selection Annotator
module_server_text_annotator(input, output, session, sel_data)
module_server_text_annotator(input, output, session, sel_data)
input , output , session
|
standard |
sel_data |
reactive df |
provides UI text box element
reactive values with input xvar, yvar and actionbutton counter
UI Module: Apply/Reset Filtering
module_ui_apply_reset(id)
module_ui_apply_reset(id)
id |
Character, identifier for variable selection |
UI Module: box for str filter condition
module_ui_box_str_filter(id, actionbtn)
module_ui_box_str_filter(id, actionbtn)
id |
Character, identifier for variable selection |
actionbtn |
reactive, action button counter |
UI Module: data summary
module_ui_checkbox(id, cond_id)
module_ui_checkbox(id, cond_id)
id |
shiny standard |
cond_id |
character, |
UI Module: filter info text output
module_ui_df_filter(id)
module_ui_df_filter(id)
id |
character, shiny namespacing |
UI text element giving number of failed filters and percent of filtered rows
UI Module: Extraction Text output
module_ui_extract_code(id)
module_ui_extract_code(id)
id |
Character string |
UI Module: Extraction File selection menu
module_ui_extract_code_fileconfig(id)
module_ui_extract_code_fileconfig(id)
id |
Character string |
UI Module: box for str filter condition
module_ui_filter_str(id)
module_ui_filter_str(id)
id |
Character string |
UI Module: Grouptable Relayout Buttons
module_ui_group_relayout_buttons(id)
module_ui_group_relayout_buttons(id)
id |
Character string |
UI Module: group selection
module_ui_group_select(id)
module_ui_group_select(id)
id |
Character, identifier for variable selection |
UI Module: box for str filter condition
module_ui_group_selector_table(id)
module_ui_group_selector_table(id)
id |
Character string |
UI Module: dynamic histogram output for n vars
module_ui_histograms(id)
module_ui_histograms(id)
id |
Character string |
UI Module: Delete selection buttons
module_ui_lowercontrol_btn(id)
module_ui_lowercontrol_btn(id)
id |
Character string |
UI Module: DT for annotation
module_ui_plot_annotation_table(id)
module_ui_plot_annotation_table(id)
id |
Character string |
UI Module: plotly plot
module_ui_plot_selectable(id)
module_ui_plot_selectable(id)
id |
Character string |
UI Module: selector controls
module_ui_plot_selectorcontrols(id)
module_ui_plot_selectorcontrols(id)
id |
Character string |
UI Module: data summary
module_ui_summary(id)
module_ui_summary(id)
id |
shiny standard |
UI Module: Selection Annotator
module_ui_text_annotator(id)
module_ui_text_annotator(id)
id |
Character string |
Method for printing dcr_code output
## S3 method for class 'dcr_code' print(x, ...)
## S3 method for class 'dcr_code' print(x, ...)
x |
character, code output from |
... |
additional arguments passed to |
Split data.frame/tibble based on grouping
split_groups(dframe)
split_groups(dframe)
dframe |
data.frame |
list of data frames