Dataset Curation Standards

curation

standards

The goal of the Data Team when archiving, sharing, and preserving data products (datasets and derived data products) is to reduce data uncertainty and serve specific designated communities including the source community. Datasets and derived data products are archived according to the Open Archival Information System (OAIS) and the FAIR standards of scientific data management. They are shared in a way that protects and respects TIES survey respondents.

The Data Team seeks to curate datasets actively. This involves building tools and processes to assist in capturing and refining metadata at the study and variable level as early as possible in the research data lifecycle.

	Raw Datasets	Internal Datasets	Replication/Verification Datasets	Reuse Datasets
Designated Communities	TIES Data Team	TIES Staff	Academic and scholarly users	Academic and scholarly users, host country (government) users, NGO users
Metadata Requirements	None	limited	targeted towards academic use including academic schemas	complete DDI codebook
Language / Internationalization standards	None	English and language of host country	English and language of article	English and language of host country
Derived data products	Internal datasets	Internal reports, policy briefs, presentations	academic articles	policy briefs, presentations
Repository / Storage solution	according to DUA/DMP	according to DUA/DMP	journal approved or sponsored repository	Harvard Dataverse
Deidentification/Anonymization level	None	Deidentified	Anonymized	Anonymized
Quality Assurance Level	Low	Medium	High	High

Below are a list of general concepts followed by various standards for different kinds of datasets. The appendix has background material/definitions for common curation practices.

Active Curation

The process of capturing and data and metadata early and refining it throughout the data life cycle.¹

Data Uncertainty

This is the degree to which data is inaccurate, imprecise, untrusted, or unknown. All complex systems have a degree of uncertainty inherent within them. Practically this means that when designing data products, efforts should be made to reduce or remove missingness and ensure that data products are properly described with metadata/documentation where applicable.

Desginated Community

This is the audience for a data product and can be an imagined group or persona. The term is borrowed from the Open Archival Information System (OAIS) Reference Model and defined as “an identified group of potential Consumers who should be able to understand a particular set of information.” Each designated community has its own Knowledge Base. The Knowledge Base of a DC can change over time to the point where the archive would need to update and enhance preserved information to make it understandable again for the DC.

Conceptual Structure

The conceptual structure of a dataset exists in the metadata and consists of controlled vocabularies that exist as permanent URLs. It defines the concepts that underlies the elements and variables and increases interoperability.

Raw Dataset Standards

Datasets of raw data from the data collector in whatever format it’s received. Has identifiable information and can be multiple files.

Basic study level metadata is collected or collated from preregistrations and pre-analysis plans. Variable level metadata is limited to variable names and labels. At this stage there is no conceptual structure.

Due to the sensitive nature of these data, Data use agreements and data management plans disallow the sharing of these datasets. The only derived product from raw data are in process deidentified datasets designated for internal use.

Internal Datasets Standards

Data that is being processed by the Data Team but available to other Global TIES staffmembers. This data is in the process of being cleaned and deidentified. The output of this cleaning is in a csv file.

At this point study and variable level metadata has been moved into a YAML file and follows the DDI-Codebook (2.5) schema. The metadata is focused on the needs of TIES staff and partners, therefore it does not yet include a conceptual structure for interoperability.

Terms of use are still determined by the data use agreement and data management plan. Typically, this data is not shared outside of partners or through an application procedure. The data is also held according to the DUA and DMP, usually in an NYU storage solution (Box) or in private repositories.

As the data is prepared, deidentified, and getting close to finalized, derived data products may be created from it, namely data dictionaries, policy briefs, presentations, and funder reports. Metadata shared with funder reports may be multilingual in the language of the funder or partner.

Replication/Verification Datasets Standards

These datasets are generated in support of an academic derived data product (article, presentation, poster). Data is limited to the variables and measures used in the academic work and is produced in a single csv file. These datasets may include code used to generate the derived data product.

Metadata is generated in a DDI-Codebook and as documentation in a pdf. Study level metadata is focused on discovery and linked to the published article. Variable level metadata have labels and categories. A conceptual structure constructed from academic controlled vocabularies is used.

Data is licensed using a standard or custom license depending on the sensitivity of the data.

Datasets for Reuse Standards

These datasets are designed for reuse on a wider scale than a replication/verification dataset. To remove data uncertainty, multiple csv files may be used. In these files, all variables in a study are included, as long as there’s no potential harm to the data subject. In cases where the inclusion of a variable may lead to the risk of identification, it may be embargoed or recoded to limit risk.

The metadata is generated in a DDI-Codebook and as documentation in a pdf. Study level metadata is complete, with conceptual structures to address a larger designated community than other datasets. Variable level metadata includes a data dictionary, variable grouping, conceptual structures, and summary statistics, including visualizations where appropriate.

Datasets are deposited in FAIR repositories that provide the most discoverability. Currently, we deposit reuse datasets in Harvard Dataverse² because it is indexed in Google Dataset Search. The data is licensed under a custom license to increase protection of our data subjects.

Title	Categories
Active Curation	curation
Curation
Data Management Plans	standards, curation, dmp