Introduction to blueprintr • blueprintr

blueprintr is a framework for managing your data assets in a reproducible fashion. While it uses targets or drake, it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results.

Installation

# install.packages("blueprintr", repos = "https://nyuglobalties.r-universe.dev")

library(blueprintr)

Designed Use of blueprintr

blueprintr provides your data with guardrails typically found in software engineering workflows. This allows you to test and document before deploying to production.

The top level of the blueprintr workflow is a “blueprints” directory, consisting of .R and .csv files.

About blueprints

Each blueprint has two components to it:

Data Construction Spec, usually a .R file that instructs drake or targets on how to build a specific dataset.
Metadata, usually a .csv file that incorporates any mapping files and checks that need to be done on the dataset.

In order to create a blueprint, we use the blueprint function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset).

A project may need only a few blueprints, but more likely you’ll need nested blueprints to transform the data.

blueprintr generates six “steps” (targets) per blueprint:

Target name	Description
`{blueprint}_initial`	The result of running the blueprint’s `command`
`{blueprint}_blueprint`	A copy of the blueprint to be used throughout the plan
`{blueprint}_meta`	A copy of the dataset metadata — if the metadata file doesn’t exist, it will be created in this step
`{blueprint}_meta_path`	Creates the metadata file or loads it
`{blueprint}_checks`	Runs all tests on the `{blueprint}_initial` target
`{blueprint}`	The built dataset after running some cleanup tasks

When writing other steps in your workflow (be it targets or drake), it is advised to not refer to the {blueprint}_initial step since it could have problems which are discovered in the {blueprint}_checks step.

Example

Let’s take a well known dataset – mtcars, and create a blueprint for it.

# Keeping the row names under the column `rn`
our_mtcars <- mtcars |> tidytable::as_tidytable(rownames = "rn")

# Inspecting our mtcars dataset
head(our_mtcars)

When we ingest data from various sources, it’s usually helpful to outline the expected metadata for the sources. At TIES, we document this metadata in a user-created “mapping file.” This mapping file acts as a map for any variable name changes, as well as categorical variable coding changes.

mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE)

# Read this csv file:
item_mapping <- mapping_file |>
  readr::read_csv(
    col_types = readr::cols(
      name_1 = readr::col_character(),
      description_1 = readr::col_character(),
      coding_1 = readr::col_character(),
      panel = readr::col_character(),
      homogenized_name = readr::col_character(),
      homogenized_coding = readr::col_character(),
      homogenized_description = readr::col_character()
    )
  )
item_mapping
#> # A tibble: 12 × 7
#>    name_1 description_1       coding_1 panel homogenized_name homogenized_coding
#>    <chr>  <chr>               <chr>    <chr> <chr>            <chr>             
#>  1 rn     Name of car          NA      MTCA… name              NA               
#>  2 mpg    Miles per gallon     NA      MTCA… mpg               NA               
#>  3 cyl    Number of cylinders  NA      MTCA… cyl               NA               
#>  4 disp   Displacement         NA      MTCA… disp              NA               
#>  5 hp     Gross horsepower     NA      MTCA… hp                NA               
#>  6 drat   Rear axle ratio      NA      MTCA… drat              NA               
#>  7 wt     Weight               NA      MTCA… wt                NA               
#>  8 qsec   Quarter mile time    NA      MTCA… qsec              NA               
#>  9 vs     Engine              "coding… MTCA… vs               "coding(code(\"1\…
#> 10 am     Transmission        "coding… MTCA… am               "coding(code(\"1\…
#> 11 gear   Number of forward …  NA      MTCA… gear              NA               
#> 12 carb   Number of carburet…  NA      MTCA… carb              NA               
#> # ℹ 1 more variable: homogenized_description <chr>

Then, we typically use a tool such as panelcleaner to attach our mapping file to the mtcars database. This is a command executed in the dataset construction spec.

blueprint(
  "mt_cars",
  description = "mtcars database with attached metadata",
  annotate = TRUE,
  command = {
    pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |>
      panelcleaner::add_mapping(item_mapping) |>
      panelcleaner::homogenize_panel() |>
      panelcleaner::bind_waves() |>
      as.data.frame()

    pnl_name <- get_attr(pnl, "panel_name")
    pnl_mapping <- get_attr(pnl, "mapping")

    pnl <-
      pnl

    class(pnl) <- c("mapped_df", class(pnl))
    set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
  }
) |>
  bp_include_panelcleaner_meta()
#> <blueprint: 'mt_cars'>
#> 
#> Description: mtcars database with attached metadata
#> Annotations: ENABLED
#> Metadata location: '/home/runner/work/blueprintr/blueprintr/blueprints/mt_cars.csv'
#> 
#> -- Command --
#> Workflow command:
#> {
#>     pnl <- as.data.frame(panelcleaner::bind_waves(panelcleaner::homogenize_panel(panelcleaner::add_mapping(panelcleaner::enpanel("MTCARS_PANEL", 
#>         our_mtcars), item_mapping))))
#>     pnl_name <- get_attr(pnl, "panel_name")
#>     pnl_mapping <- get_attr(pnl, "mapping")
#>     pnl <- pnl
#>     class(pnl) <- c("mapped_df", class(pnl))
#>     set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
#> }
#> 
#> Raw command:
#> {
#>     pnl <- as.data.frame(panelcleaner::bind_waves(panelcleaner::homogenize_panel(panelcleaner::add_mapping(panelcleaner::enpanel("MTCARS_PANEL", 
#>         our_mtcars), item_mapping))))
#>     pnl_name <- get_attr(pnl, "panel_name")
#>     pnl_mapping <- get_attr(pnl, "mapping")
#>     pnl <- pnl
#>     class(pnl) <- c("mapped_df", class(pnl))
#>     set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
#> }

Save this script with a filename of your choice inside of the “blueprints” directory of your project. We’ll assume you are using targets for your project:

./
  _targets.R
  blueprints/
    ... all blueprint R and CSV files go here ...
  R/
    ... all associated R function definitions are here ...
  project.Rproj
  ...

It is not required to use panelcleaner or even document the source metadata. This is just a convention we at TIES developed. However, we strongly advise doing something similar to track your data sources over time.

When running this code with either targets or drake, the blueprint metadata is automatically created. For our mtcars example, this looks like:

#> Rows: 13 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, type, description, coding
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 13 × 4
#>    name  type      description             coding                               
#>    <chr> <chr>     <chr>                   <chr>                                
#>  1 name  character Name of Car              NA                                  
#>  2 mpg   double    Miles per gallon         NA                                  
#>  3 cyl   double    Number of cylinders      NA                                  
#>  4 disp  double    Displacement             NA                                  
#>  5 hp    double    Gross horsepower         NA                                  
#>  6 drat  double    Rear axle ratio          NA                                  
#>  7 wt    double    Weight                   NA                                  
#>  8 qsec  double    Quarter mile time        NA                                  
#>  9 vs    character Engine                  "coding(code(\"straight\",\"1\"), co…
#> 10 am    character Transmission            "coding(code(\"manual\",\"1\"), code…
#> 11 gear  double    Number of forward gears  NA                                  
#> 12 carb  double    Number of carburetors    NA                                  
#> 13 wave  character NA                       NA

Manually editing the metadata allows the user to add tests to check the data type and values.

The last step of our work is to load this blueprint into either targets or drake. For this example, we’ll use targets as drake is deprecated. A full discussion of targets is beyond the scope of this vignette, but you can find an excellent walkthrough here. The only detail that is needed is to add blueprintr::tar_blueprints() to your _targets.R file:

# _targets.R
library(targets)

# ...

list(
  tar_target(
    item_mapping,
    readr::read_csv("where/your/mapping/file/is/stored.csv")
  ),
  
  blueprintr::tar_blueprints(),

  # Other targets for your project!
)

This will load all blueprints in the “blueprints” directory. If you have a nested directory structure, use blueprintr::tar_blueprints(recurse = TRUE).

And there you have it! You have created your first blueprint on the mtcars dataset. When running a pipeline with blueprintr, the checks allow researchers to be warned of any issues at an early stage, allowing them to produce replicable results.