Maintain a metadata table of attributes for the variables in your data.
- description
- data type
- units
- description of date or factor data
@tomjwebb I see tons of spreadsheets that i don't understand anything (or the stduent), making it really hard to share.
— Erika Berenguer (@Erika_Berenguer) January 16, 2015
@tomjwebb @ScientificData “Document. Everything.” Data without documentation has no value.
— Sven Kochmann (@indianalytics) January 16, 2015
@tomjwebb Annotate, annotate, annotate!
— CanJFishAquaticSci (@cjfas) January 16, 2015
Document all the metadata (including protocols).@tomjwebb
— Ward Appeltans (@WrdAppltns) January 16, 2015
You download a zip file of #OpenData. Apart from your data file(s), what else should it contain?
— Leigh Dodds (@ldodds) February 6, 2017
Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).
Not an easy task
Most university libraries have assistants dedicated to Research Data Management:
@tomjwebb @ScientificData Talk to their librarian for data management strategies #datainfolit
— Yasmeen Shorish (@yasmeen_azadi) January 16, 2015
My suggestion to you: Ecological Metadata Language
EML is a set of XML schema documents and controlled vocabularies that allow for the structural expression of metadata.
Wide adoption and use of EML will create exciting new opportunities for data discovery, access, integration and synthesis.
+ eml2 r pkg can help you build an EML
object when you are ready from modular elements.
- eml
- dataset
- creator
- title
- publisher
- pubDate
- keywords
- abstract
- intellectualRights
- contact
- methods
- coverage
- geographicCoverage
- temporalCoverage
- taxonomicCoverage
- dataTable
- entityName
- entityDescription
- physical
- attributeList
coverage
informationMake sure to record units!
methods
documentKeep a dynamic document used to plan, record and write up methods.
@tomjwebb record every detail about how/where/why it is collected
— Sal Keith (@Sal_Keith) January 16, 2015
attribute
table.metadatar
packageI’m developing a package to help with extracting information and creating simple metadata files that can form the basis of building more complex metadata formats (eg EML).
meta_tbl
in which to complete any further info requiredinstall.packages("devtools")
devtools::install_github("annakrystalli/metadatar")
library(metadatar)
create_meta_shell(mtcars)
## # A tibble: 11 x 11
## attributeName attributeDefiniti… columnClasses numberType unit minimum
## <chr> <lgl> <chr> <lgl> <lgl> <lgl>
## 1 mpg NA numeric NA NA NA
## 2 cyl NA numeric NA NA NA
## 3 disp NA numeric NA NA NA
## 4 hp NA numeric NA NA NA
## 5 drat NA numeric NA NA NA
## 6 wt NA numeric NA NA NA
## 7 qsec NA numeric NA NA NA
## 8 vs NA numeric NA NA NA
## 9 am NA numeric NA NA NA
## 10 gear NA numeric NA NA NA
## 11 carb NA numeric NA NA NA
## # ... with 5 more variables: maximum <lgl>, formatString <lgl>,
## # definition <lgl>, code <lgl>, levels <lgl>
attributes
columnsI use recognized column headers shown here to make it easier to create an EML object down the line. I focus on the core columns required but you can add additional ones for your own purposes.
names(create_meta_shell(mtcars))
## [1] "attributeName" "attributeDefinition" "columnClasses"
## [4] "numberType" "unit" "minimum"
## [7] "maximum" "formatString" "definition"
## [10] "code" "levels"
"numeric"
, "character"
, "factor"
, "ordered"
, or "Date"
, case sensitive)columnClasses
dependant attributesnumeric
(ratio or interval) data:
character
(textDomain) data:
dateTime
data:
11-03-2001
formatString would be "DD-MM-YYYY"
code
and levels
to store information on factors. I use ";"
to separate code and level descriptions. These can be extracted by metadatar
function extract_attr_factors()
later on.Call it gapminderRR
We will carry on in this project for the rest of the course
data/
folderCreate a data/
folder
data/
folder create:
raw/
folderclean/
foldermetadata/
folderscripts/
foldergapminderRR
folder structure should look like this
.
├── data
│ ├── clean
│ ├── metadata
│ └── raw
├── gapminderRR.Rproj
└── scripts
In honour of Hans Rosling, we’ll use the gapminder data, which has been made easily accessible for use by Jenny Bryan’s gapminder
package.
First, let’s install the package:
install.packages("gapminder")
Let’s have a look:
gapminder::gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
Like many real data sets, column headings are convenient for data entry and manipulation, but not particularly descriptive to a user not already familiar with the data.
More importantly, they don’t let us know what units they are measured in (or in the case of categorical / factor data, what the factor abbreviations refer to). So let us take a moment to be more explicit:
Let’s open a script to work in and write a copy of the data to our raw
data folder.
here
in package here
to create robust pathways relative to the project root)::
access a function from a package without loading the whole thing through library()
install.packages("here")
install.packages("tidyverse")
readr::write_csv(gapminder::gapminder, path = here::here("data/raw/gapminder.csv"))
From now on we’ll pretend the gapmnider.csv
is our raw data we have on disk. So let’s read it in and start documenting it.
gapminder_df <- readr::read_csv(here::here("data/raw/gapminder.csv"))
gapminder_df
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
meta_tbl
shellWe can use function create_meta_shell()
to create a metadata shell for our gapminder_df
dataset.
?create_meta_shell
metadatar::create_meta_shell(gapminder_df, factor_cols = c("country", "continent"))
## # A tibble: 6 x 11
## attributeName attributeDefinition columnClasses numberType unit minimum
## <chr> <lgl> <chr> <lgl> <lgl> <lgl>
## 1 country NA character NA NA NA
## 2 continent NA character NA NA NA
## 3 year NA numeric NA NA NA
## 4 lifeExp NA numeric NA NA NA
## 5 pop NA numeric NA NA NA
## 6 gdpPercap NA numeric NA NA NA
## # ... with 5 more variables: maximum <lgl>, formatString <lgl>,
## # definition <lgl>, code <chr>, levels <chr>
We specify that "country"
and "continent"
are columns that contain factors and the function automatically extracts and collapses the codes and levels for us.
gapminder_meta_shell.csv
gapminder_meta_shell <- metadatar::create_meta_shell(gapminder_df,
factor_cols = c("country", "continent"))
write.csv(gapminder_meta_shell, file = here::here("data/metadata/gapminder_meta_shell.csv"),
row.names = F)
Create a new script to work in in the scripts/
folder. Save it as helper01_create-metadata.R
Create a metadata table shell and save it in the metadata/
folder as gapminder_meta_shell.csv
gapminder
pkg GitHub repository for information on the dataset.