+ - 0:00:00
Notes for current slide
Notes for next slide

FAIR data: resources and details!
Avoiding data problems,
best practices for metadata
and controlled vocabularies

seaside chat August 3 2020

Sarah Gaichas
Ecosystem Dynamics and Assessment
Northeast Fisheries Science Center

1 / 10

Open Science, FAIR data, and multiple projects

Figure by SangyaPundir - Own work, CC BY-SA 4.0, Link

2 / 10

Following on from Creating R-packages

But not limited to R!

Brief Description: Resources and activities related to FAIR data and software

https://www.rd-alliance.org/system/files/2019-02-01-Top-10-FAIR-Data-and-Software-Things.pdf

Related to Fishery Condition Links, multispecies keyrun, and other projects

Discussion topics:

3 / 10

FAIR: What Findable and Accessible mean for us

4 / 10

FAIR: What Interoperable and Reusable mean for us

Interoperable for us: people using different models or analytical software can use a common base dataset to facilitate synthesis across analyses

Interoperable for the world: common data exchange format (e.g., JSON), vocabulary, and links to other data; see example framework

Reusable for us: we can reproduce previous results with documented methods, and use the same data in new analyses

Reusable for the world: has a license for use, includes provenance, meets community standards

5 / 10

More concrete examples: lists by field

Australian Government Data example (similar public data requirements as US)

Australian Research Data Commons FAIR summary graphic Image: ARDC 2018 - CC-BY 4.0; link

Activities:

6 / 10

Avoiding data problems

Guide to Bad Data

Key problems we can avoid/things data contributors need to supply:

  • data provenance
  • describe all fields (rich metadata)
  • units included
  • correct version
  • entry errors, etc., etc.

Tools/methods for cleaning datasets

Data documentation enforced in R packages

ecodata logo from GitHub

7 / 10

Metadata: descriptive and structural

What is it? Video link

Example: fueleconomy vehicles dataset from vehicle.R

#' Vehicle data
#'
#' Fuel economy data from the EPA, 1985-2015. This dataset contains
#' selected varaibles, and removes vehicles with incomplete data (e.g.
#' no drive train data)
#'
#' @format A data frame with variables:
#' \describe{
#' \item{id}{Unique EPA identifier}
#' \item{make}{Manufacturer}
#' \item{model}{Model name}
#' \item{year}{Model year}
#' \item{class}{EPA vehicle size class,
#' \url{http://www.fueleconomy.gov/feg/ws/wsData.shtml#VClass}}
#' \item{trans}{Transmission}
#' \item{drive}{Drive train}
#' \item{cyl}{Number of cylinders}
#' \item{displ}{Engine displacement, in litres}
#' \item{fuel}{Fuel type}
#' \item{hwy}{Highway fuel economy, in mpg}
#' \item{cty}{City fuel economy, in mpg}
#' }
#'
#' @source \url{http://www.fueleconomy.gov/feg/download.shtml}
#' @examples
#' if (require("dplyr")) {
#' vehicles
#' vehicles %>% group_by(year) %>% summarise(cty = mean(cty))
#' }
#'
"vehicles"
8 / 10

Controlled vocabulary: standard terminology for a field

9 / 10

Starting point for Fish Condition and MS-Keyrun discussions: both will use R data packages

Input observational datasets

  • id. common needs and sources across all models: environmental, fish, economic
    • NEFSC and other survey data
    • fishery dependent/industry data
    • satellite, ...
  • process at highest resolution needed by a model
    • daily length and weight at age?
  • agree on format (e.g. attributes, long vs wide)
  • document and post in project-accessible area
    • GitHub for public data
    • Private repo for confidential data?

Linking or modeled "datasets"

  • id. model outputs that become other model inputs
  • id. resolution: higher → lower works, not converse
  • agree on format: how closely can output match needed input? who does additional wrangling?
  • document and post in project-accessible area
10 / 10

Open Science, FAIR data, and multiple projects

Figure by SangyaPundir - Own work, CC BY-SA 4.0, Link

2 / 10
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow