You got data. Is it enough?

@tomjwebb I see tons of spreadsheets that i don't understand anything (or the stduent), making it really hard to share.
— Erika Berenguer (@Erika_Berenguer) January 16, 2015

@tomjwebb @ScientificData “Document. Everything.” Data without documentation has no value.
— Sven Kochmann (@indianalytics) January 16, 2015

@tomjwebb Annotate, annotate, annotate!
— CanJFishAquaticSci (@cjfas) January 16, 2015

Document all the metadata (including protocols).@tomjwebb
— Ward Appeltans (@WrdAppltns) January 16, 2015

You download a zip file of #OpenData. Apart from your data file(s), what else should it contain?
— Leigh Dodds (@ldodds) February 6, 2017

#otherpeoplesdata dream match!

Thought experiment: Imagine a dream open data set

How would you locate it?

what details would you need to know to determine relevance?
what information would you need to know to use it?

metadata = data about data

Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).

Data Reuse Checklist

http://mozillascience.github.io/checklist/

Backbone of digital curation

Without it a digital resource may be irretrievable, unidentifiable or unusable

Descriptive

enables identification, location and retrieval of data, often includes use of controlled vocabularies for classification and indexing.

Technical

describes the technical processes used to produce, or required to use a digital data object.

Administrative

used to manage administrative aspects of the digital object e.g. intellectual property rights and acquisition.

Elements of metadata

Structured data files:
- readable by machines and humans, accessible through the web
Controlled vocabularies eg. NERC Vocabulary server
- allows for connectivity of data

KEY TO SEARCH FUNCTION

By structuring & adhering to controlled vocabularies, data can be combined, accessed and searched!
Different communities develop different standards which define both the structure and content of metadata

Identifying the right metadata standard

Not an easy task

General: Dublin Core Metadata Initiative Specification
NERC Data Centers: Check with individual data centers for their metadata specification.
Re3data.org: Registry of Research Data Repositories.

Seek help from support teams

Most university libraries have assistants dedicated to Research Data Management:

@tomjwebb @ScientificData Talk to their librarian for data management strategies #datainfolit
— Yasmeen Shorish (@yasmeen_azadi) January 16, 2015

My suggestion to you: Ecological Metadata Language

Ecological Metadata Language (EML)

a metadata standard developed by and for the ecology discipline.

EML is a set of XML schema documents and controlled vocabularies that allow for the structural expression of metadata.

Harmonising ecological data

Wide adoption and use of EML will create exciting new opportunities for data discovery, access, integration and synthesis.

+ eml2 r pkg can help you build an EML object when you are ready from modular elements.

EML

- eml
  - dataset
    - creator
    - title
    - publisher
    - pubDate
    - keywords
    - abstract 
    - intellectualRights
    - contact
    - methods
    - coverage
      - geographicCoverage
      - temporalCoverage
      - taxonomicCoverage
    - dataTable
      - entityName
      - entityDescription
      - physical
      - attributeList

Documenting metadata:

the bare minimum

document `coverage` information

taxonomic coverage: a table containing taxonomic information on species in data.
- also record authority / source
temporal coverage: temporal range and resolution details
spatial coverage:
- a human readable geographic description of the study area
- spatial range and resolution details
- include depth (marine/freshwater) or altitudinal (terrestrial) information

Make sure to record units!

document protocols in a `methods` document

Keep a dynamic document used to plan, record and write up methods.

@tomjwebb record every detail about how/where/why it is collected
— Sal Keith (@Sal_Keith) January 16, 2015

Any additional information other users would need to combine your data with theirs? Record it

the variable `attribute` table.

Maintain a metadata table of attributes for the variables in your data.

description
data type
units
description of date or factor data

`metadatar` package

I’m developing a package to help with extracting information and creating simple metadata files that can form the basis of building more complex metadata formats (eg EML).

extract info from dataset and create a meta_tbl in which to complete any further info required

install.packages("devtools")
devtools::install_github("annakrystalli/metadatar")

library(metadatar)
create_meta_shell(mtcars)

## # A tibble: 11 x 11
##    attributeName attributeDefiniti… columnClasses numberType unit  minimum
##    <chr>         <lgl>              <chr>         <lgl>      <lgl> <lgl>  
##  1 mpg           NA                 numeric       NA         NA    NA     
##  2 cyl           NA                 numeric       NA         NA    NA     
##  3 disp          NA                 numeric       NA         NA    NA     
##  4 hp            NA                 numeric       NA         NA    NA     
##  5 drat          NA                 numeric       NA         NA    NA     
##  6 wt            NA                 numeric       NA         NA    NA     
##  7 qsec          NA                 numeric       NA         NA    NA     
##  8 vs            NA                 numeric       NA         NA    NA     
##  9 am            NA                 numeric       NA         NA    NA     
## 10 gear          NA                 numeric       NA         NA    NA     
## 11 carb          NA                 numeric       NA         NA    NA     
## # ... with 5 more variables: maximum <lgl>, formatString <lgl>,
## #   definition <lgl>, code <lgl>, levels <lgl>

`attributes` columns

I use recognized column headers shown here to make it easier to create an EML object down the line. I focus on the core columns required but you can add additional ones for your own purposes.

names(create_meta_shell(mtcars))

##  [1] "attributeName"       "attributeDefinition" "columnClasses"      
##  [4] "numberType"          "unit"                "minimum"            
##  [7] "maximum"             "formatString"        "definition"         
## [10] "code"                "levels"

Attributes associated with all variables:

attributeName (required, free text field)
attributeDefinition (required, free text field)
columnClasses (required, "numeric", "character", "factor", "ordered", or "Date", case sensitive)

`columnClasses` dependant attributes

For numeric (ratio or interval) data:
- unit (required, see eml-unitTypeDefinitions and working with units)
For character (textDomain) data:
- definition (required)
For dateTime data:
- formatString (required) e.g for date 11-03-2001 formatString would be "DD-MM-YYYY"
I use the columns code and levels to store information on factors. I use ";" to separate code and level descriptions. These can be extracted by metadatar function extract_attr_factors() later on.

Example attribute table structure

Start a new project to work in.

Call it gapminderRR
We will carry on in this project for the rest of the course

Setup the `data/` folder

Create a data/ folder
Within the data/ folder create:
- a raw/ folder
- a clean/ folder
- a metadata/ folder

Setup a `scripts/` folder

gapminderRR folder structure should look like this

.
├── data
│   ├── clean
│   ├── metadata
│   └── raw
├── gapminderRR.Rproj
└── scripts

Data

In honour of Hans Rosling, we’ll use the gapminder data, which has been made easily accessible for use by Jenny Bryan’s gapminder package.

First, let’s install the package:

install.packages("gapminder")

Get the data

Let’s have a look:

gapminder::gapminder

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Like many real data sets, column headings are convenient for data entry and manipulation, but not particularly descriptive to a user not already familiar with the data.
More importantly, they don’t let us know what units they are measured in (or in the case of categorical / factor data, what the factor abbreviations refer to). So let us take a moment to be more explicit:

Get the data

Let’s open a script to work in and write a copy of the data to our raw data folder.

use function here in package here to create robust pathways relative to the project root)
use :: access a function from a package without loading the whole thing through library()

install.packages("here")
install.packages("tidyverse")
readr::write_csv(gapminder::gapminder, path = here::here("data/raw/gapminder.csv"))

From now on we’ll pretend the gapmnider.csv is our raw data we have on disk. So let’s read it in and start documenting it.

gapminder_df <- readr::read_csv(here::here("data/raw/gapminder.csv"))

double check the data

gapminder_df

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Create `meta_tbl` shell

We can use function create_meta_shell() to create a metadata shell for our gapminder_df dataset.

?create_meta_shell

metadatar::create_meta_shell(gapminder_df, factor_cols = c("country", "continent"))

## # A tibble: 6 x 11
##   attributeName attributeDefinition columnClasses numberType unit  minimum
##   <chr>         <lgl>               <chr>         <lgl>      <lgl> <lgl>  
## 1 country       NA                  character     NA         NA    NA     
## 2 continent     NA                  character     NA         NA    NA     
## 3 year          NA                  numeric       NA         NA    NA     
## 4 lifeExp       NA                  numeric       NA         NA    NA     
## 5 pop           NA                  numeric       NA         NA    NA     
## 6 gdpPercap     NA                  numeric       NA         NA    NA     
## # ... with 5 more variables: maximum <lgl>, formatString <lgl>,
## #   definition <lgl>, code <chr>, levels <chr>

We specify that "country" and "continent" are columns that contain factors and the function automatically extracts and collapses the codes and levels for us.

Create and save `gapminder_meta_shell.csv`

gapminder_meta_shell <- metadatar::create_meta_shell(gapminder_df, 
                                               factor_cols = c("country", "continent"))

write.csv(gapminder_meta_shell, file = here::here("data/metadata/gapminder_meta_shell.csv"), 
          row.names = F)

Exercise: Complete the metadata shell.

Create a new script to work in in the scripts/ folder. Save it as helper01_create-metadata.R
Create a metadata table shell and save it in the metadata/ folder as gapminder_meta_shell.csv
complete in your prefered spreadsheet editing software and save to gapminder_meta.csv
- Consult the documentation in the gapminder pkg GitHub repository for information on the dataset.

Metadata

ACCE Research Data and Project Management

01-02 May 2018, TUoS

You got data. Is it enough?

#otherpeoplesdata dream match!

Thought experiment: Imagine a dream open data set

How would you locate it?

metadata = data about data

Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).

Data Reuse Checklist

Backbone of digital curation

Without it a digital resource may be irretrievable, unidentifiable or unusable

Descriptive

Technical

Administrative

Elements of metadata

Structured data files:

Controlled vocabularies eg. NERC Vocabulary server

KEY TO SEARCH FUNCTION

Identifying the right metadata standard

Seek help from support teams

Ecological Metadata Language (EML)

a metadata standard developed by and for the ecology discipline.

Harmonising ecological data

EML

Documenting metadata:

the bare minimum

document coverage information

document protocols in a methods document

Any additional information other users would need to combine your data with theirs? Record it

the variable attribute table.

Maintain a metadata table of attributes for the variables in your data.

metadatar package

attributes columns

Attributes associated with all variables:

columnClasses dependant attributes

Example attribute table structure

Start a new project to work in.

Setup the data/ folder

Setup a scripts/ folder

Data

Get the data

Get the data

double check the data

Create meta_tbl shell

Create and save gapminder_meta_shell.csv

Exercise: Complete the metadata shell.

document `coverage` information

document protocols in a `methods` document

the variable `attribute` table.

`metadatar` package

`attributes` columns

`columnClasses` dependant attributes

Setup the `data/` folder

Setup a `scripts/` folder

Create `meta_tbl` shell

Create and save `gapminder_meta_shell.csv`