Forum Paper |
Corresponding author: Dmitry Schigel ( dschigel@gbif.org ) Academic editor: Dirk Steinke
© 2022 R. Henrik Nilsson, Anders F. Andersson, Andrew Bissett, Anders G. Finstad, Frode Fossøy, Marie Grosjean, Michael Hope, Thomas S. Jeppesen, Urmas Kõljalg, Daniel Lundin, Maria Prager, Saara Suominen, Cecilie S. Svenningsen, Dmitry Schigel.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nilsson RH, Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Prager M, Suominen S, Svenningsen CS, Schigel D (2022) Introducing guidelines for publishing DNA-derived occurrence data through biodiversity data platforms. Metabarcoding and Metagenomics 6: e84960. https://doi.org/10.3897/mbmg.6.84960
|
DNA sequencing efforts of environmental and other biological samples disclose unprecedented and largely untapped opportunities for advances in the taxonomy, ecology, and geographical distributions of our living world. To realise this potential, DNA-derived occurrence data (notably sequences with dates and coordinates) – much like traditional specimens and observations – need to be discoverable and interpretable through biodiversity data platforms. The Global Biodiversity Information Facility (GBIF) recently headed a community effort to assemble a set of guidelines for publishing DNA-derived data. These guidelines target the principles and approaches of exposing DNA-derived occurrence data in the context of broader biodiversity data. They cover a choice of terms using a controlled vocabulary, common pitfalls, and good practices, without going into platform-specific details. Our hope is that they will benefit anyone interested in better exposure of DNA-derived occurrence data through general biodiversity data platforms, including national biodiversity portals. This paper provides a brief rationale and an overview of the guidelines, an up-to-date version of which is maintained at https://doi.org/10.35035/doc-vf1a-nr22. User feedback and interaction are encouraged as new techniques and best practices emerge.
biological data management, DNA sequences, metabarcoding, metagenomics, occurrence record, open data, scientific credit, scientific reproducibility
The last 30 years have brought an increased understanding of the immense power of molecular methods for documenting the diversity of life on earth. DNA-derived data enable us to also record inconspicuous and even undescribed species – taxa that typically fall below the radar of vetted protocols for field work, checklists, and depositions into natural science collections. Expanding the concept of biological occurrences to routinely include molecular detections is a hotly discussed topic that has only relatively recently moved beyond the conceptual stage, through the Global Biodiversity Information Facility’s (GBIF; www.gbif.org/) inclusion of fungal molecular occurrence data (
Our goal was to make the guide comprehensive enough to cover at least the most popular of the many DNA-based approaches used to characterise the world’s biota, with a primary focus on metabarcoding, metagenomics, and quantitative PCR (qPCR and ddPCR). The guide assumes the data to have been collected, processed, and analysed in appropriate ways (
The mapping process started with a spreadsheet comparison of (meta)data fields used in a selection of sequence-based datasets provided by GBIF including, e.g., output from the MGnify pipeline (
Blending individual elements from existing standards may risk jeopardising universality and inclusiveness of detail in the resulting mix but should improve interoperability and maximise the coverage of cases across biomes (the minimum standard approach, see
At the time of writing, none of GBIF, OBIS, or ALA is capable of directly ingesting biological samples from observation (taxon/operational taxonomic unit) contingency tables. Therefore, the mapping step in Fig.
Overall workflow for DNA sequence-derived biodiversity data as described in the guide (https://doi.org/10.35035/doc-vf1a-nr22). Chapter numbers refer to chapters in the guide.
The guide is maintained at https://doi.org/10.35035/doc-vf1a-nr22. We intend it to be a living document that is updated as new techniques and best practices emerge, and for this reason the guide is not presented in a static version in the present publication. An overview of the aspects covered by the guide is provided in Fig.
Outline of a platform for reporting and publishing DNA sequences and associated metadata (green box) based on existing systems and data standards (grey boxes). An envisioned system for regular (based on machine-to-machine reading of data) update of results (white box) can either read, and update, the Darwin Core Archive or various other administration systems. The data transfer between the various elements (black arrows) will require various degrees of data transformation and harmonisation and may include either mechanical or human quality assessment. The items “DNA-derived data extension” and “Measurements & Facts” refer to data that must, should, or could be bundled with occurrence data and are detailed in section 2.2 of the guide.
The nature of the stakeholders of biodiversity data platforms is very diverse. Users and data depositors include students, researchers, biodiversity data managers, governmental and private agencies, policy makers, and bioinformaticians. Not all stakeholders are perhaps in the habit of approaching biological evidence through molecular means, but we sense that the interest in exploring DNA-derived data through biodiversity data platforms is growing steadily. While this highlights the need for a set of guidelines and recommendations of the present kind, it also suggests that situations and cases unforeseen by the authors and contributors of this guide are likely to surface. Similarly, recommendations and best practices are likely to change over time as new techniques and approaches emerge. We are, for instance, in the process of considering resources such as the BOLD Handbook, the Biological Observation Matrix (BIOM) format, and the EDAM ontology of bioscientific data analysis and data management (http://edamontology.org/page). Similarly, data formats that support more complex relational and hierarchical data – notably the Frictionless Data Format – are interesting and very relevant developments for the study of biodiversity. The guide has already seen a number of minor updates and improvements since its formal August 2021 release, and our ambition is to keep it updated over time. User feedback is a crucial component of this endeavour, and we warmly welcome user interaction at the URL provided in the Results section.
The purpose of exposing DNA-derived occurrence data through biodiversity platforms is to enable reuse of these data alongside other biodiversity data types. Connecting DNA sequences to traditional nomenclature through voucher specimen sequencing is still in progress in genetic reference databases. Indeed, recording sequences alongside occurrences will allow continuous update and reconfirmation of taxonomic classifications. To facilitate comparisons to traditional observations, links to databases of scientific names should be maintained. For example, OBIS adopted the present guidelines and additionally requires a direct link with Linnean names through the World Register of Marine Species (WoRMS; https://www.marinespecies.org) catalogue. Indeed, through the development and adoption of these guidelines through multiple biodiversity data networks, the sharing of large amounts of data arising from genetic studies will be made easier and promote wider use of those data. Future plans include work to enable publishing datasets across both GBIF and OBIS through a single data submission instance.
A hurdle towards the goal of integrating DNA-based occurrences into routine biological practice is the somewhat poor track record of biology when it comes to making actual research data available to begin with (e.g.,
The participation of AFA, DL, and MP in this project was partly funded through the Swedish Biodiversity Data Infrastructure (SBDI) funded by its partner organisations and the Swedish Research Council VR through Grant No 2019-00242.
The authors have declared that no competing interests exist.
Valuable discussions with members of the ELIXIR, iBOL, GGBN, GLOMICON, and OBIS networks contributed to compilation of this draft. We are especially grateful for input and encouragement from Kessy Abarenkov, Andrew Bentley, Matt Blissett, Pier Luigi Buttigieg, Kyle Copas, Camila A. Plata Corredor, Gabriele Dröge, Torbjørn Ekrem, Tobias Guldberg Frøslev, Birgit Gemeinholzer, Quentin Groom, Tim Hirsch, Donald Hobern, Hamish Holewa, Corinne Martin, Raissa Meyer, Chris Mungall, Daniel Noesgaard, Corinna Paeper, Pieter Provoost, Tim Robertson, Maxime Sweetlove, Andrew Young, John Waller, Ramona Walls, John Wieczorek, and Lucie Zinger who contributed to the GBIF community review process. We finally acknowledge the important role of Andrew Young in instigating the guidelines effort. An anonymous reviewer is acknowledged for providing valuable feedback on an earlier draft of the manuscript.