Data Paper |
Corresponding author: Adriana E. Radulovici ( adriana.radulovici@mcgill.ca ) Corresponding author: Filipe O. Costa ( fcosta@bio.uminho.pt ) Academic editor: Fedor Čiampor Jr
© 2021 Adriana E. Radulovici, Pedro E. Vieira, Sofia Duarte, Marcos A. L. Teixeira, Luisa M. S. Borges, Bruce E. Deagle, Sanna Majaneva, Niamh Redmond, Jessica A. Schultz, Filipe O. Costa.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Radulovici AE, Vieira PE, Duarte S, Teixeira MAL, Borges LMS, Deagle BE, Majaneva S, Redmond N, Schultz JA, Costa FO (2021) Revision and annotation of DNA barcode records for marine invertebrates: report of the 8 th iBOL conference hackathon. Metabarcoding and Metagenomics 5: e67862. https://doi.org/10.3897/mbmg.5.67862
|
The accuracy of specimen identification through DNA barcoding and metabarcoding relies on reference libraries containing records with reliable taxonomy and sequence quality. The considerable growth in barcode data requires stringent data curation, especially in taxonomically difficult groups such as marine invertebrates. A major effort in curating marine barcode data in the Barcode of Life Data Systems (BOLD) was undertaken during the 8th International Barcode of Life Conference (Trondheim, Norway, 2019). Major taxonomic groups (crustaceans, echinoderms, molluscs, and polychaetes) were reviewed to identify those which had disagreement between Linnaean names and Barcode Index Numbers (BINs). The records with disagreement were annotated with four tags: a) MIS-ID (misidentified, mislabeled, or contaminated records), b) AMBIG (ambiguous records unresolved with the existing data), c) COMPLEX (species names occurring in multiple BINs), and d) SHARE (barcodes shared between species). A total of 83,712 specimen records corresponding to 7,576 species were reviewed and 39% of the species were tagged (7% MIS-ID, 17% AMBIG, 14% COMPLEX, and 1% SHARE). High percentages (>50%) of AMBIG tags were recorded in gastropods, whereas COMPLEX tags dominated in crustaceans and polychaetes. The high proportion of tagged species reflects either flaws in the barcoding workflow (e.g., misidentification, cross-contamination) or taxonomic difficulties (e.g., synonyms, undescribed species). Although data curation is essential for barcode applications, such manual attempts to examine large datasets are unsustainable and automated solutions are extremely desirable.
annotation, data curation, DNA barcoding, marine invertebrates, metabarcoding, reference libraries
Reference libraries, which are collections of compliant DNA sequences assigned to species, constitute the backbone of species identification systems based on DNA barcoding and metabarcoding, and therefore, a critical component in molecular biomonitoring and molecular taxonomy (
Along with this expansion, reports of inaccurate or discordant data have become more common (
Data discordances in reference libraries have multiple origins that can be split into two broad categories. First, there are discordances that are due to real biological complexities. Some of these are merely a reflection of the inherent uncertainties and dynamics of alpha taxonomy (e.g.,
A workbench such as BOLD, where data can be easily corrected if needed, brings great value to the barcoding and metabarcoding pipelines. Several tools for automated data quality control have been implemented in BOLD, including flags to indicate if sequences of barcode markers (COI, MatK, RbcL, RbcLa, trnH-psbA, ITS, ITS2) are barcode compliant or if the protein-coding genes include stop-codons or common contaminants (e.g., human, cow, mouse, pig, bacteria). In addition, several analytical tools allow data congruence verification. For instance, discordances between species names attributed by BOLD users and the Barcode Index Numbers (BINs,
Reference libraries have been populated in part through dispersed contributions, despite a few central core facilities providing major inputs (e.g., Canadian Centre for DNA Barcoding, https://ccdb.ca). As a result, DNA sequence data, and respective metadata, which are uploaded to genetic data repositories such as BOLD or GenBank, have varied components and levels of compliance. The research practice also differs among target taxonomic groups, affecting even the type of vouchering system and metadata typically collected and accompanying each specimen (
The rationale for reviewing barcode data for marine invertebrates is particularly relevant. Marine invertebrates are often studied as a community and are one of the customary targets for marine biomonitoring using metabarcoding (
To accomplish this ambitious goal, of manually curating marine invertebrate barcode data, a hackathon was organized in the scope of the 8th International Barcode of Life (iBOL) conference (Trondheim, Norway, 2019). A group of researchers involved in marine invertebrate barcoding were convened with the purpose of undertaking a comprehensive review and annotation of the barcode records of the most representative marine invertebrate taxa currently available in BOLD. The choice to focus exclusively on this platform was based on it being the largest database designed primarily for DNA barcodes and their metadata, the existence of analytical tools embedded in the platform, and the routine process of mining data from GenBank into BOLD, thus ensuring that all DNA barcodes are hosted in one place and circumventing the preference of various researchers for different data repositories. This is a report on the approach, findings and implications for issues related to the curation of reference libraries of DNA barcodes.
BOLD is a global database structured by few mandatory fields (e.g., phylum, country of collection, and institution storing voucher specimens), including habitat as an optional field overlooked in many records, therefore a specific workflow (Fig.
Workflow employed for the review and annotation of selected marine invertebrate records in BOLD. A subset of targeted invertebrate taxa was created from the initial list downloaded from WoRMS. This list was cross-referenced with the available taxonomic list from BOLD. Subsequently, only public BOLD records assigned to a BIN were integrated in datasets and screened with two analytical tools (BIN discordance report and neighbour-joining (NJ) tree). Records deemed to be uncertain were annotated with four pre-established tags: MIS-ID (misidentification, mislabeling or contamination), AMBIG (ambiguous record), COMPLEX (species complex), SHARE (barcode sharing between species). Records suspected to be misidentified or contaminated were annotated and subsequently removed by the BOLD team from the BOLD identification engine (BOLD IDS). Records deemed reliable were not annotated.
The revision workflow (Fig.
a) MIS-ID (misidentification or contamination) – records believed to be misidentified, mislabeled or contaminated,
b) AMBIG (ambiguous) – records that could not be resolved with the existing data,
c) COMPLEX (species complex) – records belonging to species with multiple BINs and, therefore, indicative of hidden or undescribed diversity,
d) SHARE (shared barcodes) – records belonging to species known to be sharing barcodes, due to incomplete lineage sorting or hybridization, based on existing literature.
Each uncertain record was annotated with only one tag. MIS-ID tags were considered the most important since all unflagged records are used for BOLD IDS, therefore they took precedence in cases where one record was falling under multiple tags (e.g., MIS-ID and COMPLEX). Since tags were applied to records and species were usually represented by multiple records, it follows that while any given record can have only one tag, each species may have multiple tags.
The hackathon included only the inspection of COI sequences and not the inspection of morphological specimens stored around the world, resulting in a small degree of uncertainty related to the general findings. For instance, if a BIN included dozens to hundreds of sequences of species A and one sequence of species B, the record of species B was tagged as MIS-ID although other possibilities are also viable (species B is correct and species A is incorrect; they are both incorrect; they are both correct, in case of unknown shared barcodes). All the records tagged as MIS-ID were submitted to the BOLD team so they can also be flagged and removed from the database used for BOLD IDS. BOLD allows all its users to insert tags as a tool for data curation by the barcoding community. In contrast, flags can only be added by the BOLD team since they affect BOLD IDS. All flags and tags can be removed by the BOLD team, if necessary.
While detailed attention was given to discordant BINs, records in concordant BINs (i.e., BINs including records bearing only one species name) and singleton BINs (i.e., BINs represented by only one record) were also reviewed, especially in cases of species with multiple concordant BINs (COMPLEX tag). Singletons were not annotated unless they were part of a species complex. The review of records (i.e., assignment of tags as well as additional notes) was recorded directly in the spreadsheets generated by BOLD as matching files for the NJ trees. Formulas were inserted to summarize the findings (number of records tagged, number of records and species per tag type, and number of taxa reviewed at each taxonomic rank). Due to the large amount of data requiring verification and the short time available, the work initiated during the hackathon continued during the months following the event. The results were illustrated through bar graphs using GraphPad Prism 9.0 (San Diego, CA, USA).
All the records reviewed can be found in BOLD (dx.doi.org/10.5883/DS-HACK2019 and dx.doi.org/10.5883/DS-MOLL2019), and all the files with annotations are available in the Suppl. material
The initial WoRMS download had over 600,000 names from all taxonomic levels, but only approximately 200,000 names were accepted animal species names. Further filtering to invertebrate taxa of interest reduced the species list to 79,251 names as follows: Crustacea – 15,148 species, Echinodermata – 7,404 species, Mollusca – 44,883 species, and Polychaeta – 11,816 species. Only a small percentage of these species (about 10%) had barcode representation in BOLD (Table
Distribution of the reviewed DNA barcode records among the major taxonomic groups, taxonomic ranks and BINs analyzed, together with the number of tagged (MIS-ID, AMBIG, COMPLEX, SHARE) DNA barcode records and species.
Taxonomic Group | Phyla | Orders | Families | Genera | Species | BINs | DNA barcode records | Tagged species | Tagged records |
---|---|---|---|---|---|---|---|---|---|
Bivalvia | Mollusca | 26 | 71 | 279 | 741 | 672 | 10,194 | 330 | 5,088 |
Gastropoda | Mollusca | 38 | 233 | 1,066 | 3,982 | 4,235 | 39,749 | 1,582 | 15,581 |
Crustacea | Arthropoda | 5 | 107 | 349 | 828 | 1,129 | 12,647 | 290 | 6,443 |
Echinodermata | Echinodermata | 34 | 123 | 447 | 1,053 | 1,228 | 12,756 | 390 | 6,155 |
Polychaeta | Annelida | 12 | 61 | 349 | 972 | 1201 | 8,366 | 365 | 3,434 |
Total | 4 | 115 | 595 | 2,490 | 7,576 | 8,465 | 83,712 | 2,957 | 36,701 |
Globally, the hackathon effort resulted in the review of 83,712 DNA barcode records, distributed across 8,465 BINs, corresponding to 7,576 marine invertebrate species from four phyla, 115 orders, 595 families and 2,490 genera (Table
Gastropoda was the taxonomic group with the highest number of reviewed records (47.5%) and the highest number of species (53%) in the dataset (Table
Number of species, BINs, discordant BINs, and singletons (species with only one DNA barcode record) for all groups analyzed and for each major taxonomic group separately. Numbers above bars indicate the percentage of discordant BINs and singletons, respectively.
The number of BINs was highest in Gastropoda and lowest in Bivalvia (Table
Across the entire dataset reviewed, approximately 22% of BINs displayed discordance (Fig.
Nearly 39% of all species in the dataset were deemed uncertain (Fig.
Distribution of the proportion of different tags in the reviewed dataset, in terms of species (A) and DNA barcode records (B). The total number of species (A) and records (B) are added below the chart.
Approximately 44% of all reviewed DNA barcode records were tagged with one of the four initially defined tags: MIS-ID (3%), AMBIG (10%), COMPLEX (29%) and SHARE (2%) (Table
Distribution of the MIS-ID, AMBIG, COMPLEX, and SHARE tags among the major taxonomic groups, considering either the total number of species (A) or the total number of DNA barcode records (B). Numbers above bars indicate the percentage recorded within the whole tagged dataset and within each major taxonomic group.
The hackathon on marine invertebrate barcodes fulfilled a variety of purposes beyond the immediate verification of the congruence between morphology and molecular data, and subsequent revision and annotation of records submitted to BOLD. To our knowledge, it constituted the first initiative of its kind for invertebrates (though see
The investigation highlighted an important proportion of BOLD marine records that may lead to erroneous species identification, particularly those records tagged with MIS-ID and AMBIG (24% of reviewed species), in the context in which only a fraction (10%) of the world marine invertebrate species (of taxa of interest here) had any representation in BOLD. On the other hand, the revision also identified a relatively high proportion of species harboring undescribed intraspecific diversity (14% of species tagged as COMPLEX). Species with MIS-ID tags constituted a relatively small portion among the full set reviewed here (7%), although it is possible that some AMBIG records are MIS-ID but could not be fully resolved with the information available (see also discussion further below regarding AMBIG). The review uncovered substantial differences in the proportion of MIS-ID between taxonomic groups, with incidence percentages up to five to six times greater in Bivalvia and certain Gastropoda groups. This finding suggests that continued efforts to audit these two groups in particular are required. MIS-ID tags were below 4% in the remaining taxonomic groups. Despite the fact that misidentifications are not a very concerning fraction of the records, depending on the taxonomy and context of the research where the data is used (e.g., detection of non-indigenous marine species, see
The fact that the majority of AMBIG tags were applied to uncertain data was not surprising because the review took a conservative approach, and this tag was used as a last resort when no other tag could be reliably assigned. This might have inflated the number of AMBIG tags that would have been assigned to other categories if this precautionary approach had not been taken, but it is impossible to ascertain to what extent. On the other hand, a detailed taxa-partitioned inspection of the AMBIG records unraveled a highly unbalanced distribution, with some particular taxonomic groups like Nudibranchia, Littorinimorpha and Pulmonata contributing disproportionately to the global numbers of tagged species (26%, 31% and 54%, respectively; Suppl. material
It is important, but challenging, to discern between misleading data resulting from errors in the barcoding workflow, and inaccurate data resulting from a lack of basic taxonomic knowledge, unsolved taxonomic conundrums, unrecognized synonyms or a taxon’s status being in flux. Some of the AMBIG tags may result from misidentifications, while others may simply indicate unsolved taxonomies that, if sorted out, may reveal congruence between molecular and non-molecular data. Eventually, part of the molecular data may even be evidencing the “true” species boundaries currently masked by complex morphological traits. AMBIG tagged records should therefore be taken as a signal for caution in their use unless the end-user can find additional information for their clarification. A potential solution would be to avoid species-level identification when using these tags, giving preference to higher rank assignments (although even errors at these ranks cannot be excluded with certainty). Recognition of taxonomic groups which have a large number of AMBIG tags could provide a focus for more detailed taxonomic work to clarify the status of various species.
The COMPLEX tag is the second most prevalent overall, but it is also the only one that does not necessarily preclude the accurate identification of specimens. It simply signals cases where possible undescribed intraspecific diversity was found. While usually COMPLEX meant a species split into two BINs, some cases of multiple splits were also found (e.g., Capitella neoaciculata with five BINs or Paracorophium excavatum with 15 associated BINs). Occurrences of multiple and highly divergent intraspecific lineages have been abundantly and increasingly reported in diverse groups of marine invertebrates, suggesting the existence of considerable hidden diversity (e.g.,
Although not so critical for the accuracy of identifications, at least according to the current status of taxonomic knowledge, there are important aspects of the COMPLEX tag to consider. Most notably, it helps when perceiving the overall quantity of presumptive marine invertebrate species awaiting verification and eventual consolidation and description. Failing to recognize this considerable amount of hidden diversity may be just as detrimental for bioassessment and monitoring as the MIS-ID or AMBIG cases (
A number of marine invertebrates with cosmopolitan or wide distributions are being discovered to be complexes comprising several units with narrower or restricted distributions (e.g.,
Species and records tagged with SHARE are by far the lowest proportion globally and within each taxonomic group. SHARE tags are associated with situations of low interspecific divergence coupled with incomplete sorting and haplotype sharing, as well as hybridization and introgression. As a result, rather than a reference library issue arising from flaws in the barcoding procedure, these indicate situations where the COI barcode sequences are unable to differentiate species based on values of genetic distances. They can, however, be used in situations of fully sorted and well-established species with records in the same BIN, sometimes separated by very low genetic distances. Either way, these results indicate that the occurrence of SHARE cases is minimal and can be promptly identified, or, in the latter case, circumvented through the accumulation of records into the libraries and refinement of the BIN assignment for that particular group. Gastropoda was again the group with the highest incidence of SHARE tags, reinforcing the perception that greater research effort is needed for taxonomic clarification of marine members of this group. As an example, Littorina saxatilis (BOLD:AAG1552) was found to share barcodes with L. compressa and L. arcana. Previous studies using various mitochondrial markers (NADH1, tRNApro, NADH6 and partial cytochrome b by Doellmann et al. 2011, COI by
The BIN discordance report generated in BOLD is a highly valuable validation tool which easily highlights uncertain cases in need of careful examination. However, concordant BINs are not exempt from misidentification, especially less represented BINs, with barcodes from one project, thus probably identified by one person, or from multiple projects where BOLD users did not hold taxonomic information for their specimens and relied solely on the existing information in BOLD which, if erroneous initially, could have been propagated into their projects. In addition, singletons are very difficult, if not impossible, to verify. As the hackathon data included about 30–40% singletons for each taxonomic group investigated, it is possible that a larger proportion of the current marine data might need to be tagged with one of the four labels discussed above.
The challenges found in evaluating barcode data, particularly marine barcode data, point to the need for better practices when generating, analyzing, and publishing barcode data. BIN discordances owing to synonyms might be avoided with greater synchronization between WoRMS and BOLD. Interim species names, whether derived from original BOLD records or data mined from GenBank and accounting for a substantial proportion of BIN discordances, would benefit from being checked using BOLD IDS on a regular basis and updated if matches are found (although difficulties of taxonomic updates for GenBank-mined data have been already mentioned).
The one-day hackathon and the following months of annotation work contributed significantly to the curation of the BOLD DNA barcode reference libraries for key marine invertebrate groups. Although numerous significant taxonomic groups were omitted from analyses, it was still a massive undertaking that required the individual review and annotation of a large number of records and species. Despite the significant efforts, the hackathon only provided a snapshot of BOLD marine data from June 2019. Records that were flagged or tagged during the hackathon would ideally be cleared in a short period of time, by a coordinated effort by BOLD data owners together with the BOLD team, allowing them to be included in reliable and trustworthy barcode libraries.
Ideally, this kind of event should be repeated on a regular basis, in tandem with the addition of new entries to reference libraries. However, as a corollary of this enterprise, it was very evident that the immense effort required to complete this task cannot be underestimated, and that it could hardly be repeated in the same format.
Indeed, a much more practical approach is needed in future endeavors, and this pilot exercise provided some possible solutions to substantially simplify the review procedure. For instance, recent applications such as BAGS (
Therefore, whereas ML and AI-type of approaches may help to considerably reduce the number of records requiring review, turning hackathon-like initiatives into practical and feasible commitments, at the end of the line there will be the need for human-mediated verification at least, and hopefully, for a minor set of records. In this regard, DNA barcode reference libraries are no different from other biodiversity data, and, ideally, strategies for data curation through community involvement, similar to the community of editors curating taxonomic data on WoRMS, could be used as inspiration and transposed to the DNA barcoding practice.
The hackathon was organized with financial support from the European Union COST Action DNAqua-Net (CA 15219 https://dnaqua.net/) in the scope of the 8th International Barcode of Life Conference in Trondheim, Norway on 16 June 2019. DNAqua-Net is acknowledged for the funding provided and the local conference organizers for all the logistical support that ensured a successful event. Tyler Elliot and the rest of the BOLD team are acknowledged for their help with data queries and analytics. The authors also thank the hackathon participants for vibrant discussions during and after the event: Berry van der Hoorn, Katrine Konsghavn, Guy Paz, Mouna Rifi, Malin Strand, Anne Helene Tandberg, Adam Wall, and Endre Willassen. Marcos A. L. Teixeira was supported by a PhD grant from the Portuguese Foundation for Science and Technology (FCT I.P.) co-financed by ESF (SFRH/BD/131527/2017). Financial support granted by FCT to Sofia Duarte (CEECIND/00667/2017) and to Pedro E. Vieira (project NIS-DNA, PTDC/BIA-BMA/29754/2017) is also acknowledged. Sanna Majaneva was financially supported by the Norwegian Taxonomy Initiative (project no. 70184235). The authors thank the five reviewers who provided valuable input into the earlier version of the manuscript.
Figure S1 and Tables S1–S10
Data type: Image and tables (in zip. archive)
Explanation note: Figure S1. Number of species tagged with AMBIG within each order in the Gastropoda. Table S1. Amphipoda. Table S2. Bivalvia. Table S3. Gastropods1. Table S4. Gastropods2. Table S5. Gastropods3. Table S6. Gastropods4. Table S7. Gastropods5. Table S8. Crustacea. Table S9. Echinodermata. Table S10. Polychaeta.