You got data. Is it enough?




#otherpeoplesdata dream match!

Thought experiment: Imagine a dream open data set

How would you locate it?

  • what details would you need to know to determine relevance?
  • what information would you need to know to use it?


metadata = data about data

Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).


Backbone of digital curation

Without it a digital resource may be irretrievable, unidentifiable or unusable


Descriptive

  • enables identification, location and retrieval of data, often includes use of controlled vocabularies for classification and indexing.

Technical

  • describes the technical processes used to produce, or required to use a digital data object.

Administrative

  • used to manage administrative aspects of the digital object e.g. intellectual property rights and acquisition.

Elements of metadata

  • Structured data files:

    • readable by machines and humans, accessible through the web
  • Controlled vocabularies eg. NERC Vocabulary server

    • allows for connectivity of data

KEY TO SEARCH FUNCTION

  • By structuring & adhering to controlled vocabularies, data can be combined, accessed and searched!
  • Different communities develop different standards which define both the structure and content of metadata

Identifying the right metadata standard

Not an easy task


Seek help from support teams

Most university libraries have assistants dedicated to Research Data Management:

My suggestion to you: Ecological Metadata Language


Ecological Metadata Language (EML)

a metadata standard developed by and for the ecology discipline.

EML is a set of XML schema documents and controlled vocabularies that allow for the structural expression of metadata.

Harmonising ecological data

Wide adoption and use of EML will create exciting new opportunities for data discovery, access, integration and synthesis.

+ eml2 r pkg can help you build an EML object when you are ready from modular elements.


EML

- eml
  - dataset
    - creator
    - title
    - publisher
    - pubDate
    - keywords
    - abstract 
    - intellectualRights
    - contact
    - methods
    - coverage
      - geographicCoverage
      - temporalCoverage
      - taxonomicCoverage
    - dataTable
      - entityName
      - entityDescription
      - physical
      - attributeList

Documenting metadata:

the bare minimum

document coverage information

  • taxonomic coverage: a table containing taxonomic information on species in data.
    • also record authority / source
  • temporal coverage: temporal range and resolution details
  • spatial coverage:
    • a human readable geographic description of the study area
    • spatial range and resolution details
    • include depth (marine/freshwater) or altitudinal (terrestrial) information

Make sure to record units!


document protocols in a methods document

Keep a dynamic document used to plan, record and write up methods.

Any additional information other users would need to combine your data with theirs? Record it

the variable attribute table.

Maintain a metadata table of attributes for the variables in your data.

  • description
  • data type
  • units
  • description of date or factor data

metadatar package

I’m developing a package to help with extracting information and creating simple metadata files that can form the basis of building more complex metadata formats (eg EML).

  • extract info from dataset and create a meta_tbl in which to complete any further info required
install.packages("devtools")
devtools::install_github("annakrystalli/metadatar")

library(metadatar)
create_meta_shell(mtcars)
## # A tibble: 11 x 11
##    attributeName attributeDefiniti… columnClasses numberType unit  minimum
##    <chr>         <lgl>              <chr>         <lgl>      <lgl> <lgl>  
##  1 mpg           NA                 numeric       NA         NA    NA     
##  2 cyl           NA                 numeric       NA         NA    NA     
##  3 disp          NA                 numeric       NA         NA    NA     
##  4 hp            NA                 numeric       NA         NA    NA     
##  5 drat          NA                 numeric       NA         NA    NA     
##  6 wt            NA                 numeric       NA         NA    NA     
##  7 qsec          NA                 numeric       NA         NA    NA     
##  8 vs            NA                 numeric       NA         NA    NA     
##  9 am            NA                 numeric       NA         NA    NA     
## 10 gear          NA                 numeric       NA         NA    NA     
## 11 carb          NA                 numeric       NA         NA    NA     
## # ... with 5 more variables: maximum <lgl>, formatString <lgl>,
## #   definition <lgl>, code <lgl>, levels <lgl>

attributes columns

I use recognized column headers shown here to make it easier to create an EML object down the line. I focus on the core columns required but you can add additional ones for your own purposes.

names(create_meta_shell(mtcars))
##  [1] "attributeName"       "attributeDefinition" "columnClasses"      
##  [4] "numberType"          "unit"                "minimum"            
##  [7] "maximum"             "formatString"        "definition"         
## [10] "code"                "levels"

Attributes associated with all variables:

  • attributeName (required, free text field)
  • attributeDefinition (required, free text field)
  • columnClasses (required, "numeric", "character", "factor", "ordered", or "Date", case sensitive)



columnClasses dependant attributes

  • For numeric (ratio or interval) data:
  • For character (textDomain) data:
    • definition (required)
  • For dateTime data:
    • formatString (required) e.g for date 11-03-2001 formatString would be "DD-MM-YYYY"
  • I use the columns code and levels to store information on factors. I use ";" to separate code and level descriptions. These can be extracted by metadatar function extract_attr_factors() later on.

Example attribute table structure

Start a new project to work in.

  • Call it gapminderRR

  • We will carry on in this project for the rest of the course


Setup the data/ folder

  • Create a data/ folder

  • Within the data/ folder create:
    • a raw/ folder
    • a clean/ folder
    • a metadata/ folder

Setup a scripts/ folder


gapminderRR folder structure should look like this

.
├── data
│   ├── clean
│   ├── metadata
│   └── raw
├── gapminderRR.Rproj
└── scripts

Data

In honour of Hans Rosling, we’ll use the gapminder data, which has been made easily accessible for use by Jenny Bryan’s gapminder package.

First, let’s install the package:

install.packages("gapminder")

Get the data

Let’s have a look:

gapminder::gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

  • Like many real data sets, column headings are convenient for data entry and manipulation, but not particularly descriptive to a user not already familiar with the data.

  • More importantly, they don’t let us know what units they are measured in (or in the case of categorical / factor data, what the factor abbreviations refer to). So let us take a moment to be more explicit:


Get the data

Let’s open a script to work in and write a copy of the data to our raw data folder.

  • use function here in package here to create robust pathways relative to the project root)
  • use :: access a function from a package without loading the whole thing through library()
install.packages("here")
install.packages("tidyverse")
readr::write_csv(gapminder::gapminder, path = here::here("data/raw/gapminder.csv"))

From now on we’ll pretend the gapmnider.csv is our raw data we have on disk. So let’s read it in and start documenting it.

gapminder_df <- readr::read_csv(here::here("data/raw/gapminder.csv"))

double check the data

gapminder_df
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Create meta_tbl shell

We can use function create_meta_shell() to create a metadata shell for our gapminder_df dataset.

?create_meta_shell
metadatar::create_meta_shell(gapminder_df, factor_cols = c("country", "continent"))
## # A tibble: 6 x 11
##   attributeName attributeDefinition columnClasses numberType unit  minimum
##   <chr>         <lgl>               <chr>         <lgl>      <lgl> <lgl>  
## 1 country       NA                  character     NA         NA    NA     
## 2 continent     NA                  character     NA         NA    NA     
## 3 year          NA                  numeric       NA         NA    NA     
## 4 lifeExp       NA                  numeric       NA         NA    NA     
## 5 pop           NA                  numeric       NA         NA    NA     
## 6 gdpPercap     NA                  numeric       NA         NA    NA     
## # ... with 5 more variables: maximum <lgl>, formatString <lgl>,
## #   definition <lgl>, code <chr>, levels <chr>

We specify that "country" and "continent" are columns that contain factors and the function automatically extracts and collapses the codes and levels for us.


Create and save gapminder_meta_shell.csv

gapminder_meta_shell <- metadatar::create_meta_shell(gapminder_df, 
                                               factor_cols = c("country", "continent"))
write.csv(gapminder_meta_shell, file = here::here("data/metadata/gapminder_meta_shell.csv"), 
          row.names = F)

Exercise: Complete the metadata shell.

  • Create a new script to work in in the scripts/ folder. Save it as helper01_create-metadata.R

  • Create a metadata table shell and save it in the metadata/ folder as gapminder_meta_shell.csv

  • complete in your prefered spreadsheet editing software and save to gapminder_meta.csv