FAIR data: resources and details!
 Avoiding data problems,
 best practices for metadata
and controlled vocabulariesseaside chat August 3 2020Sarah Gaichas 
 Ecosystem Dynamics and Assessment 
 Northeast Fisheries Science Center1 / 10

Open Science, FAIR data, and multiple projects

Plan for interoperability

Common data structures/clear metadata
- Readable by different software
- Spatial, temporal scale defined
- Units defined
- Source defined
Resources
- https://en.wikipedia.org/wiki/FAIR_data
- https://www.go-fair.org/fair-principles/

Figure by SangyaPundir - Own work, CC BY-SA 4.0, Link

2 / 10

Following on from Creating R-packages

But not limited to R!

Brief Description: Resources and activities related to FAIR data and software

https://www.rd-alliance.org/system/files/2019-02-01-Top-10-FAIR-Data-and-Software-Things.pdf

Related to Fishery Condition Links, multispecies keyrun, and other projects

Discussion topics:

Data problems to avoid; https://github.com/Quartz/bad-data-guide
Metadata; https://data.research.cornell.edu/content/readme
Controlled vocabulary--thesaurus e.g. https://guides.lib.utexas.edu/metadata-basics/controlled-vocabs

3 / 10

FAIR: What Findable and Accessible mean for us

Findable by us: the most up to date project inputs are in one central place, and version controlled

Findable by the world: has a DOI and lots of keywords

Accessible to us: platform-agnostic in GitHub repositories (public or private) with basic instructions for data access in README

Accessible to the world: same

4 / 10

FAIR: What Interoperable and Reusable mean for us

Interoperable for us: people using different models or analytical software can use a common base dataset to facilitate synthesis across analyses

Interoperable for the world: common data exchange format (e.g., JSON), vocabulary, and links to other data; see example framework

Reusable for us: we can reproduce previous results with documented methods, and use the same data in new analyses

Reusable for the world: has a license for use, includes provenance, meets community standards

(Meta)data are richly described with a plurality of accurate and relevant attributes

5 / 10

More concrete examples: lists by field

Australian Government Data example (similar public data requirements as US)

Australian Research Data Commons FAIR summary graphic Image: ARDC 2018 - CC-BY 4.0; link

Activities:

Difference between FAIR data and Open Data
How FAIR is our data? Assessment tool
Writing good dataset descriptions best practices
Identifiers like DOIs and licensing; general and US Government
Dirty data and cleanup--next slide
Metadata and controlled vocabularies--the slides after

6 / 10

Avoiding data problems

Guide to Bad Data

Key problems we can avoid/things data contributors need to supply:

data provenance
describe all fields (rich metadata)
units included
correct version
entry errors, etc., etc.

Tools/methods for cleaning datasets

Data documentation enforced in R packages

distribute data along with documentation
R data package examples
- babynames
- fueleconomy
- our own ecodata and its landing page

ecodata logo from GitHub

7 / 10

Metadata: descriptive and structural

What is it? Video link

Example: fueleconomy vehicles dataset from vehicle.R

#' Vehicle data
#'
#' Fuel economy data from the EPA, 1985-2015. This dataset contains
#' selected varaibles, and removes vehicles with incomplete data (e.g.
#' no drive train data)
#'
#' @format A data frame with variables:
#' \describe{
#' \item{id}{Unique EPA identifier}
#' \item{make}{Manufacturer}
#' \item{model}{Model name}
#' \item{year}{Model year}
#' \item{class}{EPA vehicle size class,
#'  \url{http://www.fueleconomy.gov/feg/ws/wsData.shtml#VClass}}
#' \item{trans}{Transmission}
#' \item{drive}{Drive train}
#' \item{cyl}{Number of cylinders}
#' \item{displ}{Engine displacement, in litres}
#' \item{fuel}{Fuel type}
#' \item{hwy}{Highway fuel economy, in mpg}
#' \item{cty}{City fuel economy, in mpg}
#' }
#'
#' @source \url{http://www.fueleconomy.gov/feg/download.shtml}
#' @examples
#' if (require("dplyr")) {
#' vehicles
#' vehicles %>% group_by(year) %>% summarise(cty = mean(cty))
#' }
#'
"vehicles"

8 / 10

Controlled vocabulary: standard terminology for a field

Example:

Climate and Forecasting NetCDF conventions

Thank you Kim Hyde!

http://cfconventions.org/

http://cfconventions.org/Data/cf-standard-names/73/build/cf-standard-name-table.html

https://www.unidata.ucar.edu/software/udunits/

xkcd comic "Standards" https://xkcd.com/927/

9 / 10

Starting point for Fish Condition and MS-Keyrun discussions: both will use R data packages

Input observational datasets

id. common needs and sources across all models: environmental, fish, economic
- NEFSC and other survey data
- fishery dependent/industry data
- satellite, ...
process at highest resolution needed by a model
- daily length and weight at age?
agree on format (e.g. attributes, long vs wide)
document and post in project-accessible area
- GitHub for public data
- Private repo for confidential data?

Linking or modeled "datasets"

id. model outputs that become other model inputs
id. resolution: higher → lower works, not converse
- Condition output is seasonal but Prices use daily input. Problem?
agree on format: how closely can output match needed input? who does additional wrangling?
document and post in project-accessible area

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

FAIR data: resources and details! Avoiding data problems, best practices for metadataand controlled vocabularies

seaside chat August 3 2020

Sarah Gaichas Ecosystem Dynamics and Assessment Northeast Fisheries Science Center

Open Science, FAIR data, and multiple projects

Following on from Creating R-packages

But not limited to R!

FAIR: What Findable and Accessible mean for us

FAIR: What Interoperable and Reusable mean for us

More concrete examples: lists by field

Avoiding data problems

Metadata: descriptive and structural

Controlled vocabulary: standard terminology for a field

Starting point for Fish Condition and MS-Keyrun discussions: both will use R data packages

Open Science, FAIR data, and multiple projects

Help

FAIR data: resources and details!
Avoiding data problems,
best practices for metadata
and controlled vocabularies

Sarah Gaichas
Ecosystem Dynamics and Assessment
Northeast Fisheries Science Center