Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models

Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that...

Full description

Bibliographic Details
Main Authors: Fuehrding-Potschkat, Petra, Ickert-Bond, Stefanie M.
Format: Other/Unknown Material
Language:unknown
Published: Zenodo 2022
Subjects:
Online Access:https://doi.org/10.5061/dryad.8pk0p2np4
id ftzenodo:oai:zenodo.org:6834791
record_format openpolar
spelling ftzenodo:oai:zenodo.org:6834791 2024-09-15T18:41:31+00:00 Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models Fuehrding-Potschkat, Petra Ickert-Bond, Stefanie M. 2022-07-14 https://doi.org/10.5061/dryad.8pk0p2np4 unknown Zenodo https://zenodo.org/communities/dryad https://doi.org/10.5061/dryad.8pk0p2np4 oai:zenodo.org:6834791 info:eu-repo/semantics/openAccess Creative Commons Zero v1.0 Universal https://creativecommons.org/publicdomain/zero/1.0/legalcode Professor of Botany and Curator of the UA Museum Herbarium (ALA) FNA Regional Coordinator Alaska-Yukon info:eu-repo/semantics/other 2022 ftzenodo https://doi.org/10.5061/dryad.8pk0p2np4 2024-07-27T04:11:11Z Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. This study examined the extent to which cleaned data from six pipelines using data cleaning tools (e.g., the GBIF web application, different R packages) affect downstream species distribution models. In addition, we assessed how the pipeline data differ from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false-positives, invalid coordinates, and duplicates, leading to data sets that included between 9,484 (GBIF application) and 5,196 records (manual-guided filtering). The expert data consisted of 703 thoroughly handpicked records, comparable to data from field studies. Although differences in the record numbers were relatively large, stacked species distribution models (sSDM) from the pipelines and the expert data were strongly related (mean Pearson's r across the pipelines: 0.9986, versus the expert data: 0.9173). The ever-stronger correlations resulted from occurrence information that became increasingly condensed in the course of the workflow (from individual occurrences to collectivized occurrences in grid cells to predicted probabilities in the sSDMs). In sum, our results suggest that the R package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. However, major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of expert taxonomic knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving ... Other/Unknown Material Alaska Yukon Zenodo
institution Open Polar
collection Zenodo
op_collection_id ftzenodo
language unknown
topic Professor of Botany and Curator of the UA Museum Herbarium (ALA)
FNA Regional Coordinator Alaska-Yukon
spellingShingle Professor of Botany and Curator of the UA Museum Herbarium (ALA)
FNA Regional Coordinator Alaska-Yukon
Fuehrding-Potschkat, Petra
Ickert-Bond, Stefanie M.
Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
topic_facet Professor of Botany and Curator of the UA Museum Herbarium (ALA)
FNA Regional Coordinator Alaska-Yukon
description Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. This study examined the extent to which cleaned data from six pipelines using data cleaning tools (e.g., the GBIF web application, different R packages) affect downstream species distribution models. In addition, we assessed how the pipeline data differ from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false-positives, invalid coordinates, and duplicates, leading to data sets that included between 9,484 (GBIF application) and 5,196 records (manual-guided filtering). The expert data consisted of 703 thoroughly handpicked records, comparable to data from field studies. Although differences in the record numbers were relatively large, stacked species distribution models (sSDM) from the pipelines and the expert data were strongly related (mean Pearson's r across the pipelines: 0.9986, versus the expert data: 0.9173). The ever-stronger correlations resulted from occurrence information that became increasingly condensed in the course of the workflow (from individual occurrences to collectivized occurrences in grid cells to predicted probabilities in the sSDMs). In sum, our results suggest that the R package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. However, major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of expert taxonomic knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving ...
format Other/Unknown Material
author Fuehrding-Potschkat, Petra
Ickert-Bond, Stefanie M.
author_facet Fuehrding-Potschkat, Petra
Ickert-Bond, Stefanie M.
author_sort Fuehrding-Potschkat, Petra
title Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
title_short Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
title_full Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
title_fullStr Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
title_full_unstemmed Data from: Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
title_sort data from: influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models
publisher Zenodo
publishDate 2022
url https://doi.org/10.5061/dryad.8pk0p2np4
genre Alaska
Yukon
genre_facet Alaska
Yukon
op_relation https://zenodo.org/communities/dryad
https://doi.org/10.5061/dryad.8pk0p2np4
oai:zenodo.org:6834791
op_rights info:eu-repo/semantics/openAccess
Creative Commons Zero v1.0 Universal
https://creativecommons.org/publicdomain/zero/1.0/legalcode
op_doi https://doi.org/10.5061/dryad.8pk0p2np4
_version_ 1810485914713980928