class: right, middle, my-title, title-slide # FAIR data: resources and details!
Avoiding data problems,
best practices for metadata
and controlled vocabularies ## seaside chat August 3 2020 ### Sarah Gaichas
Ecosystem Dynamics and Assessment
Northeast Fisheries Science Center --- class: top, left # Open Science, FAIR data, and _multiple_ projects .pull-left[ - Plan for **interoperability** + [(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation](https://www.go-fair.org/fair-principles/i1-metadata-use-formal-accessible-shared-broadly-applicable-language-knowledge-representation/) + [(Meta)data use vocabularies that follow the FAIR principles](https://www.go-fair.org/fair-principles/i2-metadata-use-vocabularies-follow-fair-principles/) + [(Meta)data include qualified references to other (meta)data](https://www.go-fair.org/fair-principles/i3-metadata-include-qualified-references-metadata/) ![:img](EDAB_images/2560px-FAIR_data_principles.jpg) ] .pull-right[ - Common data structures/clear metadata + Readable by different software + Spatial, temporal scale defined + Units defined + Source defined - Resources + https://en.wikipedia.org/wiki/FAIR_data + https://www.go-fair.org/fair-principles/ .contrib[ Figure by <a href="//commons.wikimedia.org/w/index.php?title=User:SangyaPundir&action=edit&redlink=1" class="new" title="User:SangyaPundir (page does not exist)">SangyaPundir</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=53414062">Link</a> ] ] ??? --- # Following on from [Creating R-packages](https://noaa-edab.github.io/presentations/20191009_createRPackages_Beet.html#1) # But not limited to R! Brief Description: Resources and activities related to FAIR data and software https://www.rd-alliance.org/system/files/2019-02-01-Top-10-FAIR-Data-and-Software-Things.pdf Related to Fishery Condition Links, multispecies keyrun, and other projects Discussion topics: * Data problems to avoid; https://github.com/Quartz/bad-data-guide * Metadata; https://data.research.cornell.edu/content/readme * Controlled vocabulary--thesaurus e.g. https://guides.lib.utexas.edu/metadata-basics/controlled-vocabs --- # FAIR: What Findable and Accessible mean for us .pull-left[ Findable by us: the most up to date project inputs are in one central place, and version controlled Findable by the world: has a DOI and lots of keywords * [(Meta) data are assigned globally unique and persistent identifiers](https://www.go-fair.org/fair-principles/f1-meta-data-assigned-globally-unique-persistent-identifiers/) * [Data are described with rich metadata](https://www.go-fair.org/fair-principles/f2-data-described-rich-metadata/) * [Metadata clearly and explicitly include the identifier of the data they describe](https://www.go-fair.org/fair-principles/f3-metadata-clearly-explicitly-include-identifier-data-describe/) * [(Meta)data are registered or indexed in a searchable resource](https://www.go-fair.org/fair-principles/f4-metadata-registered-indexed-searchable-resource/) ] .pull-right[ Accessible to us: platform-agnostic in GitHub repositories (public or private) with basic instructions for data access in README Accessible to the world: same * [(Meta)data are retrievable by their identifier using a standardised communication protocol](https://www.go-fair.org/fair-principles/metadata-retrievable-identifier-standardised-communication-protocol/) + [The protocol is open, free and universally implementable](https://www.go-fair.org/fair-principles/a1-1-protocol-open-free-universally-implementable/) + [The protocol allows for an authentication and authorisation where necessary](https://www.go-fair.org/fair-principles/a1-2-protocol-allows-authentication-authorisation-required/) * [Metadata should be accessible even when the data are no longer available](https://www.go-fair.org/fair-principles/a2-metadata-accessible-even-data-no-longer-available/) ] --- # FAIR: What Interoperable and Reusable mean for us .pull-left[ Interoperable for us: people using different models or analytical software can use a common base dataset to facilitate synthesis across analyses Interoperable for the world: common data exchange format (e.g., [JSON](https://en.wikipedia.org/wiki/JSON)), vocabulary, and links to other data; [see example framework](https://github.com/FAIRDataTeam/FAIRDataPoint-Spec) * [(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation](https://www.go-fair.org/fair-principles/i1-metadata-use-formal-accessible-shared-broadly-applicable-language-knowledge-representation/) * [(Meta)data use vocabularies that follow the FAIR principles](https://www.go-fair.org/fair-principles/i2-metadata-use-vocabularies-follow-fair-principles/) * [(Meta)data include qualified references to other (meta)data](https://www.go-fair.org/fair-principles/i3-metadata-include-qualified-references-metadata/) ] .pull-right[ Reusable for us: we can reproduce previous results with documented methods, and use the same data in new analyses Reusable for the world: has a license for use, includes provenance, meets community standards * [(Meta)data are richly described with a plurality of accurate and relevant attributes](https://www.go-fair.org/fair-principles/r1-metadata-richly-described-plurality-accurate-relevant-attributes/) + [(Meta)data are released with a clear and accessible data usage license](https://www.go-fair.org/fair-principles/r1-1-metadata-released-clear-accessible-data-usage-license/) + [(Meta)data are associated with detailed provenance](https://www.go-fair.org/fair-principles/r1-2-metadata-associated-detailed-provenance/) + [(Meta)data meet domain-relevant community standards](https://www.go-fair.org/fair-principles/r1-3-metadata-meet-domain-relevant-community-standards/) ] --- # More concrete examples: [lists by field](https://www.rd-alliance.org/system/files/2019-02-01-Top-10-FAIR-Data-and-Software-Things.pdf) .pull-left[ Australian Government Data example (similar public data requirements as US) ![Australian Research Data Commons FAIR summary graphic](https://www.ands.org.au/__data/assets/image/0011/1416098/FAIR-Data-image-map-graphic-v2-721px.png) .contrib[Image: ARDC 2018 - CC-BY 4.0; [link](https://www.ands.org.au/__data/assets/image/0011/1416098/FAIR-Data-image-map-graphic-v2-721px.png)] ] .pull-right[ Activities: * Difference between [FAIR data and Open Data](https://www.go-fair.org/resources/faq/ask-question-difference-fair-data-open-data/) * How FAIR is our data? [Assessment tool](https://ardc.edu.au/resources/working-with-data/fair-data/fair-self-assessment-tool/) * Writing good dataset descriptions [best practices](https://documentation.ardc.edu.au/display/DOC/Description#Description-Bestpractice) * Identifiers like [DOIs](https://www.ands.org.au/__data/assets/pdf_file/0006/715155/Digital-Object-Identifiers.pdf) and licensing; [general](https://www.ands.org.au/__data/assets/pdf_file/0010/1498744/Research-Data-Rights-Management-Guide.pdf) and [US Government](https://resources.data.gov/open-licenses/) * Dirty data and cleanup--next slide * Metadata and controlled vocabularies--the slides after ] --- # Avoiding data problems .pull-left[ [Guide to Bad Data](https://github.com/Quartz/bad-data-guide) Key problems we can avoid/things data contributors need to supply: * data provenance * describe all fields (rich metadata) * units included * correct version * entry errors, etc., etc. Tools/methods for cleaning datasets * [Library Carpentry: OpenRefine](https://librarycarpentry.org/lc-open-refine/) * [The Data Retriever](https://frictionlessdata.io/blog/2017/05/24/the-data-retriever/) ] .pull-right[ Data documentation enforced in [R packages](http://r-pkgs.had.co.nz/) * distribute [data along with documentation](http://r-pkgs.had.co.nz/data.html) * R data package examples + [babynames](http://hadley.github.io/babynames/) + [fueleconomy](https://github.com/hadley/fueleconomy) + our own [ecodata](https://noaa-edab.github.io/ecodata/) and its [landing page](https://noaa-edab.github.io/ecodata/landing_page) .center[ ![:img ecodata logo from GitHub, 45%](https://github.com/NOAA-EDAB/ecodata/raw/master/ecodata_logo.png) ] ] --- # Metadata: descriptive and structural .pull-left[ What is it? [Video link](https://www.youtube.com/watch?v=L0vOg18ncWE&feature=youtu.be) <iframe width="560" height="315" src="https://www.youtube.com/embed/L0vOg18ncWE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> Example: [fueleconomy](https://github.com/hadley/fueleconomy) `vehicles` dataset from [vehicle.R](https://github.com/hadley/fueleconomy/blob/master/R/vehicle.R) ] .pull-right[ ``` #' Vehicle data #' #' Fuel economy data from the EPA, 1985-2015. This dataset contains #' selected varaibles, and removes vehicles with incomplete data (e.g. #' no drive train data) #' #' @format A data frame with variables: #' \describe{ #' \item{id}{Unique EPA identifier} #' \item{make}{Manufacturer} #' \item{model}{Model name} #' \item{year}{Model year} #' \item{class}{EPA vehicle size class, #' \url{http://www.fueleconomy.gov/feg/ws/wsData.shtml#VClass}} #' \item{trans}{Transmission} #' \item{drive}{Drive train} #' \item{cyl}{Number of cylinders} #' \item{displ}{Engine displacement, in litres} #' \item{fuel}{Fuel type} #' \item{hwy}{Highway fuel economy, in mpg} #' \item{cty}{City fuel economy, in mpg} #' } #' #' @source \url{http://www.fueleconomy.gov/feg/download.shtml} #' @examples #' if (require("dplyr")) { #' vehicles #' vehicles %>% group_by(year) %>% summarise(cty = mean(cty)) #' } #' "vehicles" ``` ] --- # Controlled vocabulary: standard terminology for a field .pull-left[ Example: Climate and Forecasting NetCDF conventions Thank you Kim Hyde! http://cfconventions.org/ http://cfconventions.org/Data/cf-standard-names/73/build/cf-standard-name-table.html https://www.unidata.ucar.edu/software/udunits/ ] .pull-right[ ![xkcd comic "Standards"](https://imgs.xkcd.com/comics/standards.png) https://xkcd.com/927/ ] --- background-image: url("EDAB_images/herrtopreds.png") background-size: 550px background-position: right ## Starting point for Fish Condition and MS-Keyrun discussions: both will use R data packages Input observational datasets + id. common needs and sources across all models: environmental, fish, economic + NEFSC and other survey data + fishery dependent/industry data + satellite, ... + process at highest resolution needed by a model + daily length and weight at age? + agree on format (e.g. attributes, long vs wide) + document and post in project-accessible area + [GitHub](https://github.com/NOAA-EDAB/FisheryConditionLinks) for public data + Private repo for confidential data? Linking or modeled "datasets" + id. model outputs that become other model inputs + id. resolution: higher → lower works, not converse + [Condition](https://drive.google.com/open?id=1eideEb83ztdA972a21MaVe-Ib4UJaH4Z) output is seasonal but [Prices](https://drive.google.com/drive/folders/1EOr5JqLh_ZYlPqPVB8RcIJXldty9ZAL_) use daily input. Problem? + agree on format: how closely can output match needed input? who does additional wrangling? + document and post in project-accessible area