The goal of dataspice-tutorial is to provide a practical exercise in creating metadata for an example field collected data product using package dataspice.

  • Understand basic metadata and why it is important.

  • Understand where and how to store them.

  • Understand how they can feed into more complex metadata objects.

dataspice workflow

see introductory slides for further background



Setup

create workshop project

Let’s create a new Rstudio project in which to work:

File > New Project > New Directory > New Project > practical-data-management

In our new project, let’s create a data/ folder in which to store the data.

dir.create("data")

Let’s also open a new R script in which to work:

File > New File > R Script

Save it in the project root, eg as metadata_dev.R


Packages

Let’s install and load all the packages we’ll need for the workshop:

install.packages("readr")
install.packages("devtools")
devtools::install_github("ropenscilabs/dataspice")
library(readr)
library(dataspice)



🚦 Data

For more information on the data source check the tutorial README

Get data

The readr::read_csv() function allows us to download and read in raw csv data from a URL.

mt <- read_csv("https://raw.githubusercontent.com/annakrystalli/dataspice-tutorial/master/data/vst_mappingandtagging.csv")

ppy <- read_csv("https://raw.githubusercontent.com/annakrystalli/dataspice-tutorial/master/data/vst_perplotperyear.csv")

Inspect data

You can inspect any object in your environment in Rstudio using function View()

vst_mappingandtagging
mt %>% View()
vst_perplotperyear
ppy %>% View()

Save data

write_csv(mt, here::here("data", "vst_mappingandtagging.csv"))

write_csv(ppy, here::here("data", "vst_perplotperyear.csv"))

Once we’ve saved our data files in the data folder, we can use functions in the dataspice package to create metadata files and complete them.



🚦 Create metadata files

We’ll start by creating the basic metadata .csv files in which to collect metadata related to our example dataset using function dataspice::create_spice().

create_spice()

This creates a metadata folder in your project’s data folder (although you can specify a different directory if required) containing 4 .csv files in which to record your metadata.


  • access.csv: record details about where your data can be accessed.
  • attributes.csv: record details about the variables in your data.
  • biblio.csv: record dataset level metadata like title, description, licence and spatial and temoral coverage.
  • creators.csv: record creator details.



Record metadata


creators.csv

The creators.csv contains details of the dataset creators.

Let’s start with a quick and easy file to complete, the creators. We can open and edit the file using in an interactive shiny app using dataspice::edit_creators().

Although we did not collect this data, just complete with your own details for the purposes of this tutorial.

edit_creators()

Remember to click on Save when you’re done editing.


access.csv

The access.csv contains details about where the data can be accessed.

Before manually completing any details in the access.csv, we can use dataspice’s dedicated function prep_access() to extract relevant information from the data files themselves.

prep_access()

Next, we can use function edit_access() to complete the final details required, namely the URL at which each dataset can be downloaded from. Use the URL from which we donloaded each data file in the first place (hint ☝️).

We can also edit details such as the name field to something more informative if required.

edit_access()

Remember to click on Save when you’re done editing.


🚦 biblio.csv

The biblio.csv contains dataset level metadata like title, description, licence and spatial and temoral coverage.

Before we start filling this table in, we can use some base R functions to extract some of the information we require. In particular we can use function range() to extract the temporal and spatial extents of our data from the columns containing temporal and spatial data.

get temporal extent

Although dates are stored as a text string, because they are in ISO format (YYYY-MM-DD), sorting them results in correct chronological ordering. If your temporal data is not in ISO format, consider converting them (see package lubridate)

range(ppy$date) 
## [1] "2015-06-06" "2015-11-18"

get geographical extent

The lat/lon coordinates are in decimal degrees which again are easy to sort or calculate the range in each dimension.

South/North boundaries
range(ppy$decimalLatitude)
## [1] 42.39229 44.06795
West/East boundaries
range(ppy$decimalLongitude)
## [1] -72.26573 -71.28145

NB: you can also supply the geographic boundaries of your data as a single well-known text string in field wktString instead of supplying the four boundary coordinates.


Now that we’ve got the values for our temporal and spatial extents, we can complete the fields in the biblio.csv file using function dataspice::edit_biblio().

edit_biblio()

Remember to click on Save when you’re done editing.

🔍 metadata hunt

Additional information required to complete these fields can be found on the NEON data portal page for this dataset and the dataspice-tutorial repository README

Here’s an example to get you started


🚦 attributes.csv

The attributes.csv contains details about the variables in your data.

Again, dataspice provides functionality to populate the attributes.csv by extracting the variable names from individual data files using function dataspice::prep_attributes().

The functions is vectorised and maps over each .csv file in our data/ folder.

prep_attributes()


We can now use dataspice::edit_attributes() to fill in the final details, namely the description and units associated with the variables.

edit_attributes()

Remember to click on Save when you’re done editing.

🔍 metadata hunt

Descriptions

Additional information required to complete these fields can be found on the NEON data portal page for this dataset and the dataspice-tutorial repository README. Useful naming convention that apply throughout NEON data products can be found here

Units

For dataspice, we have opted to use unit specification which can be parsed by R pkg units, a package which provides a class for maintaining unit metadata and functionality for checking compatibility and conversion. units is itself based on the UDUNITS-2 C library, a library for units of physical quantities and unit-definition and value-conversion utility.

You can install and search for units using units pkg (have a look at the package vignette), but for a quicker browsing and searching of units, you can use the handy “Units and Symbols Found in the UDUNITS2 Database” web app by the North Carolina Institute for Climate Studies to identify prefixes and base unit definitions.

For now, use the name rather than the symbol UDUNITS definition. More complex units can be definied arithmetically and by combining base units. (eg meter cubed could be specificied as m3 or m^3. See units vignette for more details).


Here’s an example to get you started



🚦 Create metadata json-ld file

Now that all our metadata files are complete, we can compile it all into a structured dataspice.json file in our data/metadata/ folder.

write_spice()

Here’s an interactive view of the dataspice.json file we just created:

jsonlite::read_json(here::here("data", "metadata", "dataspice.json")) %>% listviewer::jsonedit()

Publishing this file on the web means it will be indexed by Google Datasets search! 😃 👍



Build README site

Finally, we can use the dataspice.json file we just created to produce an informative README web page to include with our dataset for humans to enjoy! 🤩

We use function dataspice::build_site() which creates file index.html in the docs/ folder of your project (which it creates if it doesn’t already exist).

build_site()


View the resulting file here


Here’s a screen shot!


back to the outro slides