An Introduction to using R package Dataspice to collect metadata for EMODnet Biology Data Products
The goal of this tutorial is to provide a practical exercise in creating metadata for an example field collected data product using package dataspice
.
Understand basic metadata and why it is important.
Understand where and how to store them.
Understand how they can feed into more complex metadata objects.
To follow this tutorial, you will need to have a recent version of R and ideally RStudio Desktop installed.
You will also need to install package dataspice
as well as a few additional helper packages.
install.packages(c("dataspice", "here", "jsonlite", "listviewer"))
You will also need to fork (make a copy of) and clone (download your fork) the example Data Product EMODnet/EMODnet-Biology-Benthic-Habitats-Occurrences-Traits I will be working with in this tutorial.
dataspice
workflowThe example Data Product I’m working with is EMODnet/EMODnet-Biology-Benthic-Habitats-Occurrences-Traits and has the following structure:
├── EMODnet_occs_habs.Rproj
├── README.html
├── README.md
├── analysis
│ ├── benthic data habitat matching.Rmd
│ ├── benthic species habitat dashboards.Rmd
│ └── species and traits.Rmd
├── data
│ ├── derived_data
│ └── raw_data
├── docs-data-habitat-matching.html
│ ├── benthic-data-habitat-matching.pdf
│ ├── benthic-species-habitat-dashboards.html
│ ├── benthic-species-habitat-dashboards.pdf
│ ├── benthic-and-traits.html
│ ├── species-and-traits.pdf
│ └── species
├── product
│ ├── benthic_sampling_event_habitat_summaries.csv
│ ├── benthic_specicies_habitat_summaries.csv
│ └── species_hab_plots
└── scripts
├── biotic traits data.R
├── get_event_habitats.R
├── get_species_habitats.R
├── get_worms_fgrp.R └── plot_species_habitats.R
Generally, it is not necessary to document raw or intermediate data files. It is important to document any outputs or analytical data produced.
We therefore focus on the contents of the product/
directory.
In particular, we will document the following two files:
benthic_sampling_event_habitat_summaries.csv
benthic_specicies_habitat_summaries.csv
We’ll start by creating the basic metadata .csv
files in which to collect metadata related to our example dataset using function dataspice::create_spice()
.
The default behaviour of the function (called without providing a dir
argument), would create a metadata
directory in the data
directory in the root of a project.
However, because I want to document the files in the product
directory, I override the default write out location by providing a dir
argument.
::create_spice(dir = "product") dataspice
This creates a metadata
directory in the project’s product
directory containing 4 .csv
files in which to record metadata.
product
├── benthic_sampling_event_habitat_summaries.csv
├── benthic_specicies_habitat_summaries.csv
├── metadata
│ ├── access.csv
│ ├── attributes.csv
│ ├── biblio.csv
│ ├── creators.csv
│ └── dataspice.json
└── species_hab_plots
├── Aphia103228_habitat_plot.pdf
└── Aphia141433_habitat_plot.pdf
creators.csv
The
creators.csv
contains details of the dataset creators.
Let’s start with a quick and easy file to complete, the creators. We can open and edit the file in an interactive shiny app using dataspice::edit_creators()
.
The default behaviour of the function (called without providing a metadata_dir
argument) will look for a creators.csv
file in the data/
directory. However, because we want to override this behaviour, we provide an explicit path for the metadata directory.
<- here::here("product", "metadata") metadata_dir
::edit_creators(metadata_dir = metadata_dir) dataspice
Once details are complete, remember to click on Save to save the changes to the creators.csv
.
access.csv
The
access.csv
contains details about where the data can be accessed.
Before manually completing any details in the access.csv
, we can use dataspice
’s dedicated function prep_access()
to extract relevant information from the data files themselves.
::prep_access(data_path = "product", access_path = "product/metadata/access.csv") dataspice
The following fileNames have been added to the access file: benthic_sampling_event_habitat_summaries.csv, benthic_specicies_habitat_summaries.csv
Next, we can use function edit_access()
to view and edit the contents of access.csv
.
::edit_access(metadata_dir) dataspice
Remember to click on Save when you’re done editing.
Otherwise, as the files are just csvs, you can edit them in your favourite editor instead.
biblio.csv
The
biblio.csv
contains dataset level metadata like title, description, licence and spatial and temporal coverage.
wellknown
for details.To provide a single point to describe the spatial aspect of the dataset, provide the same coordinates for east-west and north-south boundary definition.
Before we start filling this table in, we can use some base R functions to extract some of the information we require from the data itself. In particular we can use function range()
to extract the temporal and spatial extents of our data from columns containing temporal and spatial data.
Such information is contained in benthic_abundance_sampling_events_seabed_habs.csv
in the data/derived_data
directory.
Although dates are stored as a text string, because they are in ISO format (YYYY-MM-DD), sorting them results in correct chronological ordering. If your temporal data is not in ISO format, consider converting them (see package lubridate
)
range(sampling_events$datecollected, na.rm = TRUE)
[1] "1980-03-17T00:00:00Z" "2018-07-04T00:00:00Z"
The lat/lon coordinates are in decimal degrees which again are easy to sort or calculate the range in each dimension.
range(sampling_events$decimallatitude)
[1] 43.34082 73.63362
range(sampling_events$decimallongitude)
[1] -9.39937 35.22617
NB: you can also supply the geographic boundaries of your data as a single well-known text string in field wktString
instead of supplying the four boundary coordinates.
Now that we’ve got the values for our temporal and spatial extents, we can complete the rest of the fields in the biblio.csv
file using function dataspice::edit_biblio()
.
A lot of the rest of the information, I’ve taken from the data product metadata page on the data catalogue. Ideally, the workflow would work the other way around, i.e. the data is collected by the data product producer and is then much more easily ingested into the data catalogue.
::edit_biblio(metadata_dir) dataspice
attributes.csv
The
attributes.csv
contains details about the variables in your data. This is critical metadata to ensure data is understood and used correctly.
Again, dataspice
provides functionality to populate the attributes.csv
by extracting the variable names from our data file using function dataspice::prep_attributes()
.
The functions is vectorised and maps over each .csv
file in our product/
folder and populates the fileName
and variableName
columns.
::prep_attributes(data_path = "product",
dataspiceattributes_path = file.path(metadata_dir, "attributes.csv"))
All column names in benthic_sampling_event_habitat_summaries.csv
- benthic_specicies_habitat_summaries.csv
have been successfully extracted into the variableName
column of attributes.csv
.
Now, we can complete the description
and unitText
fields using
::edit_attributes(metadata_dir) dataspice
Note: Because this is just a demo and I am not an expert in the data, the description is just a more readable version of each variable name and there are no units. This points to the importance of such metadata being produced by someone who is knowledgeable about the data, in order to ensure accuracy of descriptions.
Now that all our metadata files are complete, we can compile it all into a structured dataspice.json
file in our product/metadata/
folder.
::write_spice(metadata_dir) dataspice
This function takes all the information collected in the csv files and compiles it into a json-ld, whose structure is compliant with schema.org dataset metadata standard.
Publishing this file on the web means it will be indexed by Google Datasets search! 😃 👍
dataspice.json
Generating the dataspice.json
file means we now have all our metadata in a standardised, interoperable, machine readable format allowing to build further automation from it.
We can for example, load it into R and review it:
<- jsonlite::read_json(file.path(metadata_dir, "dataspice.json"))
dataspice_lst
::jsonedit(dataspice_lst) listviewer
We could also have a standard markdown snippet the extracts metadata from the dataspice.json
file and include it in a README.Rmd.
For example, the following snippet in an Rmd document:
### `r dataspice_lst$name`
##### Creator: `r dataspice_lst$creator[[1]]$name`
#### Data Product Description:
`r dataspice_lst$description`
would render to the following in the html/md version:
This data product links occurrences of benthic invertebrates from the EMODnet benthic numerical abundance product to two complementary seabed habitat products: the EMODnet broadscale habitat map 2019 (Vasquez et al., 2020), and Wilson et al.’s (2018) synthetic map of the north-west European Shelf sedimentary environment. We also process and incorporate qualitative information on the habitat and sediment preferences of species from the Biological Traits Information Catalogue BIOTIC database.
Finally, we can use the dataspice.json
file we just created to produce an informative README web page to include with our dataset for humans to enjoy! 🤩
We use function dataspice::build_site()
which creates file index.html
in the docs/
folder of your project (which it creates if it doesn’t already exist).
::build_site(path = file.path(metadata_dir, "dataspice.json")) dataspice
In the repository Settings page, select the Pages tab on left hand navigation panel.
To enable GitHub pages, select the branch (master
) and folder (docs/
) containing the content to be served:
It’s useful practice to add the url on the front matter of the repository:
index.html
file onlineHere’s a screen shot!
See the whole repository
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".