Other Data Team Services
Documentation Generation
To generate documentation and reports, the Data Team uses literate programming, a programming philosophy that interweaves (or knits) together a markup language with a programming language in order to generate human readable documents. These documents can include codebooks, dataset documentation, or reports.
In order to generate the documentation, the information needed for the report has to be provided by the PIs or requester in a structured format e.g. a YAML file, JSON file, or a set of tables.
R Markdown files
R Markdown files are native to RStudio1 and use a notebook interface to weave together natural language, markup language, and code. Notebook is a document with programmatic chunks that can be executive independently of the generated document. Jupyter notebook is another notebook format, more typically used with Python and in data science.
Posit2 provides a cheat sheet3 for R Markdown. Other useful packages include {bookdown}4, {kableExtra}5 (for tables), and {flextable}6 (for tables).
Markup languages
To create static documents we rely on three markup languages.
- markdown: Lightweight markup language to add formatting to natural language data. Can be used in multiple output formats (word docs, pdf, web pages, presentations, etc.There’s also various flavors of markdown including Github and Trello. Cheat sheet: https://www.markdownguide.org/cheat-sheet/
- HyperText Markup Language (HTML): Used to structure a web page and its content. Consists of elements that can enclose, wrap, or format natural langauge data or other values. To set a uniform format for documents, the Data Team can use Cascading Style Sheets (CSS). Cheat sheet: https://web.stanford.edu/group/csp/cs21/htmlcheatsheet.pdf
- LaTeX: Document preparation system used for the communication and publication of scientific documents. Outputs pdf and allows for more intricate formatting, especially of scientific/mathematical formulas. LaTeX is more complex than markdown and HTML, allowing for more intricate customization, including custom macros. A file called preface.tex can be used to uniformly format documents. An example from the Urban Institute can be found in this Github repository. Overleaf is an online LaTeX editor that can be used for collaborative writing.
Research Impact
The impact of a dataset, publication, or author can be measured using a variety of metrics including the H-Index, times, cited, journal impact factor, or social media and internet reach. This is called bibliometrics or sciencometrics. Nontraditional impact metrics are known as altmetrics. The Data Team can help you measure and track the impact of your research, data, or other derived data products on their own or against other benchmarks, usually a bibliographic corpus around a theme or academic field.
Caveat
Impact is determined by the data source. Don’t be surprised if your H-Index changes depending on the corpus of research documents/database being used. Also, there are various ways of hacking the below basic terms, especially H-Index so always take research impact metrics with a grain of salt.
It’s also important to note that every academic work has a half life (the period in which it reaches half of its reach) and velocity, the speed at which it gets to its half life and full reach. Once an article reaches its full reach, it can become obsolete. The period to reach this point though has been growing due to the ease of access in online databases.
Bibliometric Terms
- H-Index: an author metric that matches the most times an article has been cited and the number of articles an author has published. There’s a slew of adaptations to the H-Index as well as Google Scholar’s I10 Index (the number of articles cited 10 times).
- Times cited: the number of times an article has been cited by other peer reviewed articles within a database
- Journal Impact Factor: ratio between the number of times articles in a journal has been cited and the number of citable articles.
- Immediacy index: the average number of times a journal article was cited in the year it was published.
Altmetrics
Altmetrics are alternative metrics that measure an article’s non-academic impact in social media, the news, Wikipedia, public policy, teaching, etc. They’re viewed as being complimentary to bibliometrics.
Common altmetrics used:
- Altmetric Attention score: The Altmetric score is a score that weights the different kinds of non-academic mentions an article received. A list of the sources is at https://www.altmetric.com/about-our-data/our-sources/. This score is generated by the Altmetric.com, a subsidiary of Digital Science.
- Mendeley readers: Mendeley is a social bookmarking site and citation manager for academic articles. They track the number of saves on the site as Mendeley Readers. From the user provided information they then provide metrics on their readers that have saved a specific article. Mendeley readers has been shown to have some predictive power as to the impact of an article.
Data Visualization
Data visualizations, the practice of using charts, graphs and plots to transmit information about data, is one of the services that the Data Team can provide to TIES staff.
Data visualization allows staff to use visual storytelling, making complex analytic and informational points about a dataset in an easier more accessible format. Unlike text, visualizations clearer and more direct points about a data.
Typically, the data team will build visualizations using the R package ggplot2
.
Types of data visualization
Below are the four general types of data visualizations.
- Comparative: Compares two different sets of data (either two different datasets, two subsets of the dame datasets, or even two different research questions)
- Compositional: Describes parts of a whole in a dataset
- Distributional: Shows all possible values of a dataset and where and how they occur in an instance space
- Relational: How variables in a dataset influence one another
Parts of a data visualizaiton
Besides the visualization itself, a data visualization can have any of the below aspects. Each aspect gives the viewer information on the data that the visualization is representing.
- axis
- title/subtitle
- legend
- labels
- notes
Types of visualizations
For examples of visualizations that the Data Team can build see https://www.data-to-viz.com/.
Dashboards
Dashboards are a collection of visualizations typically used for business intelligence analytics. They use visualizations, legends, drop down menus, number and timelines and other graphics as filters for the entire dashboard. This allows the user to explore metrics or KPIs within a dataset.
Dashboards also usually allow the user to join datasets from different sources, using joins from SQL/RDMS to merge them together. This allows the dashboard designer to find new actionable insights.
Dashboard Applications
The Data Team will usually use Google Data Studio to build a dashboard because it’s free to use and incorporates the same privacy features and capabilities as the Google Platform.