blueprintr
is a framework for managing your data assets
in a reproducible fashion. While it uses targets or
drake, it adds automated
steps for tabular dataset documentation and testing. This allows
researchers to create a replicable framework to prevent programming
issues from affecting analysis results.
# install.packages("blueprintr", repos = "https://nyuglobalties.r-universe.dev")
library(blueprintr)
blueprintr
provides your data with guardrails typically
found in software engineering workflows. This allows you to test and
document before deploying to production.
The top level of the blueprintr
workflow is a
“blueprints” directory, consisting of .R
and
.csv
files.
Each blueprint has two components to it:
.R
file that
instructs drake or targets on how to build a specific dataset..csv
file that incorporates any
mapping files and checks that need to be done on the dataset.In order to create a blueprint, we use the blueprint
function. This function takes three arguments: name (the name of your
generated dataset), description (a description of your dataset), command
(any functions that need to be applied in order to build the
dataset).
A project may need only a few blueprints, but more likely you’ll need nested blueprints to transform the data.
blueprintr
generates six “steps” (targets) per
blueprint:
Target name | Description |
---|---|
{blueprint}_initial |
The result of running the blueprint’s command
|
{blueprint}_blueprint |
A copy of the blueprint to be used throughout the plan |
{blueprint}_meta |
A copy of the dataset metadata — if the metadata file doesn’t exist, it will be created in this step |
{blueprint}_meta_path |
Creates the metadata file or loads it |
{blueprint}_checks |
Runs all tests on the {blueprint}_initial target |
{blueprint} |
The built dataset after running some cleanup tasks |
{blueprint}_initial
step since it could have problems which
are discovered in the {blueprint}_checks
step.
Let’s take a well known dataset – mtcars
, and create a
blueprint for it.
# Keeping the row names under the column `rn`
our_mtcars <- mtcars |> tidytable::as_tidytable(rownames = "rn")
# Inspecting our mtcars dataset
head(our_mtcars)
When we ingest data from various sources, it’s usually helpful to outline the expected metadata for the sources. At TIES, we document this metadata in a user-created “mapping file.” This mapping file acts as a map for any variable name changes, as well as categorical variable coding changes.
mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE)
# Read this csv file:
item_mapping <- mapping_file |>
readr::read_csv(
col_types = readr::cols(
name_1 = readr::col_character(),
description_1 = readr::col_character(),
coding_1 = readr::col_character(),
panel = readr::col_character(),
homogenized_name = readr::col_character(),
homogenized_coding = readr::col_character(),
homogenized_description = readr::col_character()
)
)
item_mapping
#> # A tibble: 12 × 7
#> name_1 description_1 coding_1 panel homogenized_name homogenized_coding
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 rn Name of car NA MTCA… name NA
#> 2 mpg Miles per gallon NA MTCA… mpg NA
#> 3 cyl Number of cylinders NA MTCA… cyl NA
#> 4 disp Displacement NA MTCA… disp NA
#> 5 hp Gross horsepower NA MTCA… hp NA
#> 6 drat Rear axle ratio NA MTCA… drat NA
#> 7 wt Weight NA MTCA… wt NA
#> 8 qsec Quarter mile time NA MTCA… qsec NA
#> 9 vs Engine "coding… MTCA… vs "coding(code(\"1\…
#> 10 am Transmission "coding… MTCA… am "coding(code(\"1\…
#> 11 gear Number of forward … NA MTCA… gear NA
#> 12 carb Number of carburet… NA MTCA… carb NA
#> # ℹ 1 more variable: homogenized_description <chr>
Then, we typically use a tool such as panelcleaner
to
attach our mapping file to the mtcars
database. This is a
command executed in the dataset construction spec.
blueprint(
"mt_cars",
description = "mtcars database with attached metadata",
annotate = TRUE,
command = {
pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |>
panelcleaner::add_mapping(item_mapping) |>
panelcleaner::homogenize_panel() |>
panelcleaner::bind_waves() |>
as.data.frame()
pnl_name <- get_attr(pnl, "panel_name")
pnl_mapping <- get_attr(pnl, "mapping")
pnl <-
pnl
class(pnl) <- c("mapped_df", class(pnl))
set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
}
) |>
bp_include_panelcleaner_meta()
#> <blueprint: 'mt_cars'>
#>
#> Description: mtcars database with attached metadata
#> Annotations: ENABLED
#> Metadata location: '/home/runner/work/blueprintr/blueprintr/blueprints/mt_cars.csv'
#>
#> -- Command --
#> Workflow command:
#> {
#> pnl <- as.data.frame(panelcleaner::bind_waves(panelcleaner::homogenize_panel(panelcleaner::add_mapping(panelcleaner::enpanel("MTCARS_PANEL",
#> our_mtcars), item_mapping))))
#> pnl_name <- get_attr(pnl, "panel_name")
#> pnl_mapping <- get_attr(pnl, "mapping")
#> pnl <- pnl
#> class(pnl) <- c("mapped_df", class(pnl))
#> set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
#> }
#>
#> Raw command:
#> {
#> pnl <- as.data.frame(panelcleaner::bind_waves(panelcleaner::homogenize_panel(panelcleaner::add_mapping(panelcleaner::enpanel("MTCARS_PANEL",
#> our_mtcars), item_mapping))))
#> pnl_name <- get_attr(pnl, "panel_name")
#> pnl_mapping <- get_attr(pnl, "mapping")
#> pnl <- pnl
#> class(pnl) <- c("mapped_df", class(pnl))
#> set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
#> }
Save this script with a filename of your choice inside of the “blueprints” directory of your project. We’ll assume you are using targets for your project:
./
_targets.R
blueprints/
... all blueprint R and CSV files go here ...
R/
... all associated R function definitions are here ...
project.Rproj
...
When running this code with either targets
or
drake
, the blueprint metadata is automatically created. For
our mtcars example, this looks like:
#> Rows: 13 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, type, description, coding
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 13 × 4
#> name type description coding
#> <chr> <chr> <chr> <chr>
#> 1 name character Name of Car NA
#> 2 mpg double Miles per gallon NA
#> 3 cyl double Number of cylinders NA
#> 4 disp double Displacement NA
#> 5 hp double Gross horsepower NA
#> 6 drat double Rear axle ratio NA
#> 7 wt double Weight NA
#> 8 qsec double Quarter mile time NA
#> 9 vs character Engine "coding(code(\"straight\",\"1\"), co…
#> 10 am character Transmission "coding(code(\"manual\",\"1\"), code…
#> 11 gear double Number of forward gears NA
#> 12 carb double Number of carburetors NA
#> 13 wave character NA NA
Manually editing the metadata allows the user to add tests to check the data type and values.
The last step of our work is to load this blueprint into either
targets or drake. For this example, we’ll use targets as drake is
deprecated. A full discussion of targets is beyond the scope of this
vignette, but you can find an excellent
walkthrough here. The only detail that is needed is to add
blueprintr::tar_blueprints()
to your
_targets.R
file:
# _targets.R
library(targets)
# ...
list(
tar_target(
item_mapping,
readr::read_csv("where/your/mapping/file/is/stored.csv")
),
blueprintr::tar_blueprints(),
# Other targets for your project!
)
This will load all blueprints in the “blueprints” directory. If you
have a nested directory structure, use
blueprintr::tar_blueprints(recurse = TRUE)
.
And there you have it! You have created your first blueprint on the
mtcars
dataset. When running a pipeline with
blueprintr
, the checks allow researchers to be warned of
any issues at an early stage, allowing them to produce replicable
results.