Project Structure

guide

The Data Team uses a variety of technologies, mainly in the R ecosystem, to create its data projects. Each project has a similar file structure to maintain consistency and replicability across projects:

.github/         # All Github configuration [optional]
  workflows/     # All CI workflow definition files [optional]
blueprints/      # Contains all "blueprints" of datasets
codebooks/       # Contains exported codebooks of select datasets, depending on blueprint definition
config/
  environment.R  # Definitions of all environment variables used in this project
  packages.R     # Any `library()` for packages to be available across the project
R/               # All definitions for custom functions employed in the pipeline
.Rprofile        # Supplemental file that primarily sets up `renv`
_targets.R       # The main workflow orchestration definition file
renv.lock        # Package dependency state capture file for `renv`

What’s important note from the outset is that data are not inside in these projects. Each project is versioned with git and hosted on TIES’ GitHub organization page. Most of our data are sensitive to some degree, so our operational practice is to load data directly via APIs or read from NYU Box.

Project templates can be created with the internal tool dtproj.

blueprints

The Data Team uses its blueprintr package to build, test, and document datasets. blueprintr is akin to dbt, but it is designed to manage a whole host of metadata, a necessary task for dissemination of project findings and data publication. Moreover, blueprintr operates without a connection to a data warehouse, given the assumptions of low connectivity and technical availability that the Data Team operates in.

Each blueprint is a pair of two files:

  1. The blueprint definition file: an R script with a single blueprint() command. This file details how to generate the desired dataset and optionally includes arbitrary metadata at dataset/table level.
  2. The blueprint metadata file: a CSV file with, at minimum, the columns name, type, and description. This file enumerates the variable-level metadata.

Here is an example blueprint definition file:

blueprint(
  "ch_scales",
  description = "Self-Regulation & Self-Regulated Learning",
  command =
    .TARGET("verified_child_data") %>%
      select(
        unique_id,
        c_id_01,
        starts_with("c_sr_"),
        starts_with("c_srl_")
      ) %>%
      shorten_domain_prefixes() %>%
      enumerator_regulation_score() %>%
      basic_number_renaming() %>%
      drop_underscore_in_vars(
        c("sr", "srl"),
        "^.*{var}_(\\d+)r?$"
      )
)

And here is an example metadata CSV for the same blueprint:

name type title description coding tests scale section
unique_id character Survey submission ID IDs
cid_1 character Child ID has_no_duplicates() IDs
csr4 integer csr4 Some kids find it easy to sit still when they are bored BUT Other kids find it hard to sit still when they are bored Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr5 integer csr5 Some kids find it easy to remember what they are supposed to do BUT Other kids find it hard to remember what they are supposed to do Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr8 integer csr8 Some kids find it easy to obey rules BUT Other kids find it hard to obey rules Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr10 integer csr10 Some kids find it easy to be careful BUT Other kids find it hard to be careful Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr12 integer csr12 Some kids find it easy to think before they act BUT Other kids find it hard to think before they act Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr13 integer csr13 Some kids find it easy to pay attention to their schoolwork BUT Other kids find it hard to pay attention to their schoolwork Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr15 integer csr15 Some kids find it easy to focus on things that are important BUT Other kids find it hard to focus on things that are important Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr16 integer csr16 Some kids find it easy to work on a project until they are finished BUT Other kids find it hard to work on a project until they are finished Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr17 integer csr17 Some kids find it easy to concentrate on one thing for a long time BUT Other kids find it hard to concentrate on one thing for a long time Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csr18 integer csr18 Some kids find it easy to stay focused on their goals BUT Other kids find it hard to stay focused on their goals Are you more like the kids that find it easy or hard? coding(code(""Really easy"", 3), code(""Kind of easy"", 2), code(""Kind of hard"", 1), code(""Really hard"", 0)) in_set(c(0:3, NA)) Child Self-Regulation Child Self-Regulation
csrl1 integer csrl1 Some kids make sure no one disturbs them when they study at home. Would you say you are like them? [pause for response]. Now that you decided that you [ARE/ARE NOT] like them. [If Yes] Are you ""A lot"" or ""Kind of "" like them? [If No] Are you ""A little"" or ""Not at all "" like them? [pause for response] coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl3 integer csrl3 Some kids try to find a quiet place to study at home. Would you say you are like them? [pause for response]. Now that you decided that you [ARE/ARE NOT] like them. [If Yes] Are you ""A lot"" or ""Kind of "" like them? [If No] Are you ""A little"" or ""Not at all "" like them? [pause for response] coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl4 integer csrl4 Some kids ask their friends or family for help when they are struggling with their homework. Would you say you are like them? [pause for response]. Now that you decided that you [ARE/ARE NOT] like them. [If Yes] Are you ""A lot"" or ""Kind of "" like them? [If No] Are you ""A little"" or ""Not at all "" like them? [pause for response] coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl10 integer csrl10 Some kids encourage themselves or tell themselves ""you can do it"" when they are struggling with their homework. Would you say you are like them? [pause for response]. Now that you decided that you [ARE/ARE NOT] like them. [If Yes] Are you ""A lot"" or ""Kind of "" like them? [If No] Are you ""A little"" or ""Not at all "" like them? [pause for response] coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl15 integer csrl15 Some kids review the instructions before starting their homework. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl16 integer csrl16 Some kids try to calm down or take a deep breath when they are struggling with their homework. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl17 integer csrl17 Some kids try to ""take their time"" and do their homework with patience. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl18 integer csrl18 Some kids take little breaks when working on challenging homework. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl19 integer csrl19 Some kids gather their notebooks or any materials they need before they start their homework. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl20 integer csrl20 Some kids look for information in their notes, videos, books, internet or exercises when they are struggling with their homework. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl21 integer csrl21 Some kids prepare for an exam by reviewing their notes or making study materials. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl24 integer csrl24 Some kids look to see how hard their homework is before deciding whether they will work on their homework or do something fun. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning
csrl22 integer csrl22 Some kids prepare for an exam by doing practice tests to see where they are having trouble. Would you say you are like them? coding(code(""A lot like them"", 3), code(""Kind of like them"", 2), code(""A little like them"", 1), code(""Not at all like them"", 0)) in_set(c(0:3, NA)) Child Self-Regulated Learning Child Self-Regulated Learning

These metadata CSVs have three required fields:

  • name: The variable name
  • type: The variable type (usually “character”, “integer”, “double”, or “logical”)
  • description: Description of the variable content. If the dataset corresponds to a survey, this is usually the question wording in English.

There are often other columns:

  • coding: If the variable is categorical, this contains the label-value mapping for the variable, written with rcoder syntax.
  • tests: Any content tests on the data
  • scale: If the variable belongs to a psychometric scale, the name of the scale. This is used for identifying variable groups for psychometric descriptive statistics.
  • title: A shorter description for the variable, used as a variable label when exported to Stata
  • section: Codebook section; if no section is assigned, the codebook will place the variable into the “Other” section
  • section_description: Description for the section. Useful for providing extra context for the codebook section.
  • group: Codebook subsection / variable group. Useful to have for combining a collection of variables together e.g. a scale
  • group_description: Description for the variable group. Useful for adding an introductory statement asked before each question in the group.

codebooks

“Codebooks” are essentially data dictionaries, targeted for social science research. They commonly include enumerations of variables in a dataset, as well as their descriptions and (when applicable) categorical codings. Some codebooks also include methodology descriptions and other descriptive statistics of the data.

The codebooks folder contains HTML codebook exports of selected blueprints, as indicated by the presence of blueprintr::bp_export_codebook() in the blueprint definition file:

blueprint(
  "ch_scales",
  description = "Self-Regulation & Self-Regulated Learning",
  command =
    some_command()
) |>
  bp_export_codebook()

Unless otherwise agreed upon, these codebooks are for internal purposes only. They are mainly present to support TIES’ members in their research.

config

The config folder has two main R files:

  • environment.R
  • packages.R

Other project-specific files, like YAML configuration, may be stored in this folder.

environment.R

Sensitive information necessary for pipeline function, like API keys and passwords, must be stored as environment variables and never checked into version control. Environment variables are generally stored in a personal .Renviron file, but it is our practice to load them into global variables at the start of the pipeline to avoid unnecessary calls to Sys.getenv().

Here is an example environment.R:

BOX_PATH <- Sys.getenv("BOX_PATH", unset = NULL)
F_RUN_TESTS <- as.logical(Sys.getenv("F_RUN_TESTS", unset = "FALSE"))
F_NIGHTLY <- as.logical(Sys.getenv("F_NIGHTLY", unset = "FALSE"))

CACHE_PASSPHRASE <- Sys.getenv("CACHE_PASSPHRASE", unset = NULL)

packages.R

This file serves two purposes:

  1. Capture soft dependencies in the project code (i.e. packages that are required but not directly referred to in the code)
  2. Attach packages via library() to make those packages’ exported functions available across the entire pipeline

Our project structure uses renv to manage the specific versions of packages employed in our pipeline to improve replicability. renv is able to capture these dependencies via code inspection; however sometimes a soft dependency (one that is not explicitly stated in the code) can occur e.g. a package depends on another for some plotting routine. To capture these dependecies, we place a reference to one of the package’s functions in packages.R so that renv can treat that package as a hard dependency.

Use of library() should be restricted to packages that are used extensively. As stated in @ref(rstyle-funcs-package-deps), it is preferred to use package::func() syntax in function writing; moreover, it is preferred the same style throughout most of the pipeline definition as well for clarity and long-term maintenance.

Example of packages.R:

# Retain suggested packages in renv
labelled::to_labelled
kableExtra::kable_as_image
styler::style_dir # Format-on-save capability in VSCode
languageserver::run # Necessary for VSCode to work in renv projects

# Attach packages used in the entire pipeline here
library(targets)
library(tarchetypes)
library(tidytable)
library(blueprintr)
library(rcoder)

See also