Research Article |
Corresponding author: Audrey Bourret ( audrey.bourret@dfo-mpo.gc.ca ) Corresponding author: Geneviève J. Parent ( genevieve.parent@dfo-mpo.gc.ca ) Academic editor: Florian Leese
© 2023 Audrey Bourret, Claude Nozères, Eric Parent, Geneviève J. Parent.
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Bourret A, Nozères C, Parent E, Parent GJ (2023) Maximizing the reliability and the number of species assignments in metabarcoding studies using a curated regional library and a public repository. Metabarcoding and Metagenomics 7: e98539. https://doi.org/10.3897/mbmg.7.98539
|
Biodiversity assessments relying on DNA have increased rapidly over the last decade. However, the reliability of taxonomic assignments in metabarcoding studies is variable and affected by the reference databases and the assignment methods used. Species level assignments are usually considered as reliable using regional libraries but unreliable using public repositories. In this study, we aimed to test this assumption for metazoan species detected in the Gulf of St. Lawrence in the Northwest Atlantic. We first created a regional library (GSL-rl) by data mining COI barcode sequences from BOLD, and included a reliability ranking system for species assignments. We then estimated 1) the accuracy and precision of the public repository NCBI-nt for species assignments using sequences from the regional library and 2) compared the detection and reliability of species assignments of a metabarcoding dataset using either NCBI-nt or the regional library and popular assignment methods. With NCBI-nt and sequences from the regional library, the BLAST-LCA (least common ancestor) method was the most precise method for species assignments, but the accuracy was higher with the BLAST-TopHit method (>80% over all taxa, between 70% and 90% amongst taxonomic groups). With the metabarcoding dataset, the reliability of species assignments was greater using GSL-rl compared to NCBI-nt. However, we also observed that the total number of reliable species assignments could be maximized using both GSL-rl and NCBI-nt with different optimized assignment methods. The use of a two-step approach for species assignments, i.e., using a regional library and a public repository, could improve the reliability and the number of detected species in metabarcoding studies.
classifier, cytochrome C oxidase I, GenBank, marine species, metagenomics, reference sequence library
Biodiversity assessments and monitoring using DNA have increased rapidly over the last decade given the high potential of this non-intrusive approach to uncover biodiversity efficiently with limited effort (
Several public repositories exist and can be used as reference databases to provide taxonomic assignments in metabarcoding studies. The public National Center for Biotechnology Information Nucleotide database (NCBI-nt, including the GenBank database) is the largest sequence repository and is widely used in eDNA metabarcoding studies (
Alternatively, curated regional libraries have been shown to reduce errors in species assignments (
Another source of variability in species’ assignments is the bioinformatics software and pipelines used in metabarcoding studies. Recently, studies have started to evaluate the accuracy of taxonomic assignments using various bioinformatic methods (
This study aimed to estimate the accuracy and precision of species assignments using the public repository NCBI-nt and to contrast the reliability of using NCBI-nt and a regional library for species assignments of a metabarcoding dataset, with popular taxonomic assignment methods (Fig.
Schematic representation of the three main steps of this study. A Creation of a regional library for metazoans from the Gulf of St. Lawrence (GSL-rl). Sequences were selected from BOLD and curated through multiple filtering and auditing steps (see Fig.
The creation of a curated regional library for the Gulf of St. Lawrence (hereafter GSL regional library: GSL-rl) was done through multiple rounds of data mining on BOLD for marine metazoan species (i.e., vertebrate and invertebrate) and revisions based on quality and similarity of sequences (Fig.
We created the GSL-rl to identify molecular operational taxonomic units (MOTUs) at the species level. Each species in the GSL-rl was ranked based on sequence availability and similarity (Fig.
We used the curated sequences from the GSL-rl to evaluate the accuracy and precision of species assignments using NCBI-nt (Fig.
We evaluated two performance parameters, i.e., accuracy and precision, for species assignments using NCBI-nt. To compute these parameters, we classified each taxonomic assignment (following
• A true positive (TP), or accurate species assignment, if the assignment was with the correct taxonomical classification, e.g., an Ammodytes hexapterus sequence correctly identified as is.
• A false positive (FP), or inaccurate species assignment, if the assignment was with an incorrect taxonomical classification, e.g., an Ammodytes hexapterus sequence incorrectly identified as Ammodytes marinus.
• A false negative (FN) if the assignment was at a taxonomical level higher than species, no matter if the assignment was correct or not, e.g., an Ammodytes hexapterus sequence classified as Ammodytes sp. This is equivalent to an under-classification error (
The accuracy, reflecting the proportion of accurate assignments at the species level, was defined as TP / (TP + FP + FN), whereas the precision was defined as TP / (TP + FP).
We compared the detection results from an eDNA metabarcoding dataset using GSL-rl and NCBI-nt and three assignment methods (Fig.
The three assignment methods compared were BLAST-LCA, BLAST-TopHit and IDtaxa. BLAST assignment methods were used as described in the previous section with both GSL-rl and NCBI-nt. NCBI-nt BLAST results were filtered to retain only metazoan detections and remove non-marine taxa (i.e., Homo sapiens, Arachnida, Insecta). IDtaxa is a classifier implemented within the DECIPHER R package (
We contrasted results obtained using GSL-rl and NCBI-nt with distinct ranking systems (Fig.
Raw sequence data from the metabarcoding dataset are available in the Sequence Read Archive (SRA) under the accession number PRJNA925571.
The data and scripts used in this manuscript are stored in the github repository https://github.com/GenomicsMLI-DFO/GSL_COI_ref_library The GSL-rl database (sequences, reliability ranking and trained dataset) can be found in the github repository https://github.com/GenomicsMLI-DFO/MLI_GSL-rl.
The first version of GSL-rl comprised 1304 sequences covering 439 species (158 species from the phylum Chordata spanning 68 families; 281 species of invertebrates spanning 129 families and 9 phyla) and 11 other taxa at the genus level only (Vertebrates: 3 genera from 2 families; Invertebrates: 8 genera from 8 families and 4 phyla; Fig.
Classification of 651 marine metazoan species previously observed in the Gulf of St. Lawrence and included in GSL-rl, by phylum. Species reliability ranking is based on the availability from local species and sequence similarity to closely related species in the Gulf of St. Lawrence.
We then provided a reliability ranking for each species within GSL-rl based on the completeness of sequences available (Fig.
The proportions of species assignments over all taxa were higher with the BLAST-TopHit method (range: 85.5–87.9%) than the BLAST-LCA method (range: 47.6–71.0%) with any identity thresholds (Fig.
Taxonomic assignment results of sequences from GSL-rl using NBCI-nt and the BLAST-LCA and BLAST-TopHit methods. Proportions of accurate (true positive, TP) and inaccurate (false positive, FP) species assignments are presented A for all taxonomic groups at the three identity thresholds (95%, 97%, 99%) and B by taxonomic group at the 97% threshold.
The accuracy, or proportion of accurate species assignments, was higher with the BLAST-TopHit method compared to the BLAST-LCA method, over all taxa and in each taxonomic group at all identity thresholds (Fig.
The precision was greater for the BLAST-LCA method compared to the BLAST-TopHit method over all taxa at all thresholds (BLAST-LCA range: 95.7–96.9%, BLAST-TopHit range: 93.8–94.4%; Fig.
We used an eDNA metabarcoding dataset to compare the number and the reliability of species assigned using GSL-rl and NCBI-nt with three assignment methods. The five possible combinations of repository/library and assignment methods were GSL-rl and NCBI-nt with BLAST-LCA (1, 2), GSL-rl and NCBI-nt with BLAST-TopHit (3,4), and GSL-rl with IDtaxa (5; Fig.
Assignment results at the species level using a regional library (GSL-rl) or a public repository (NBCI-nt) and popular assignment methods. We used three assignment methods, namely IDtaxa (confidence levels: 40%, 50% and 60%), BLAST-LCA, and BLAST-TopHit (identity thresholds: 95%, 97% and 99%). A Detections for each species and B number of species assignments for each source of reference sequences and method; C Comparison of species rank for all the species assigned with the two sources. Species rank categories are based on sequence availability and sequence similarity to closely related species in the Gulf of St. Lawrence for GSL-rl and on the geographic plausibility for NCBI-nt.
Across all combinations, the highest and lowest numbers of species assigned were observed with NCBI-nt and BLAST-TopHit95 (66 species) and BLAST-LCA95 (44 species), respectively (Fig.
The assignment method with the maximum number of assigned species differed between GSL-rl and NCBI-nt. The maximum number of assigned species was 62 species with GSL-rl and IDtaxa40 and 66 species with NCBI-nt and TopHit95 (Fig.
Large proportions of detected species were exclusively assigned using only GSL-rl or NCBI-nt. A total of 30 species (37.5% of all species detected) were assigned only using GSL-rl (12 species) or NCBI-nt (18 species; Fig.
The GSL-rl provides explicit reliability rankings for 651 species observed within the Gulf of St. Lawrence. We used two simple, broad categories, “Reliable” and “Unreliable”, to characterize the robustness of species assignments in eDNA metabarcoding studies. The “Reliable” category represented the vast majority of species with reference sequences (68.8%, 302 species) in GSL-rl. Similar results were obtained for marine fish species from Portugal with the COI locus (73.5%, grade A,
The GSL-rl contains reference sequences for 439 species of the 651 targeted species of interest for conservation in the Gulf of St. Lawrence (i.e., 67.4%), with reference sequences available for a relatively large proportion of invertebrates (i.e., 59.1%). In Europe, marine invertebrates represent the taxonomic group with the lowest barcode coverage, and only 22.1% have one or more sequences available (
The GSL-rl-could also improve species assignments in eDNA metabarcoding studies of the Northwest Atlantic and the Arctic Oceans compared to large public databases. The Gulf of St. Lawrence is a transitional marine region where temperate southern species may occur alongside boreal and arctic species (
We estimated two performance parameters for metazoan species assignment using NCBI-nt, and observed large variations in results of performance parameters with the two assignment methods tested. While the BLAST-LCA method provided overall higher precision in species assignments, the accuracy was greater with BLAST-TopHit, an observation in line with a previous study (
Previous studies have shown that assignment methods can affect taxa detected in metabarcoding studies (
The accuracy and precision using the sequences from GSL-rl in our study will be different at the time of reading this article due to the continuous growth of the public repository NCBI-nt. The publication of new sequences of low quality or with incorrect species identification can create unexpected ambiguities in species assignments as public repositories grow (
The method with the maximum number of species assigned to the metabarcoding dataset differed between GSL-rl and NCBI-nt. The IDtaxa40 assignment method provided the highest number of species assigned using GSL-rl. Sequence composition strategies for species assignments, such as IDtaxa and RDP, had contrasting performance results in previous benchmarking studies (
More than a third of the species assigned to the metabarcoding dataset (n = 33 out of 80) were exclusive to either GSL-rl or NCBI-nt . For GSL-rl, the exclusion of non-indigenous species or mislabeled sequences increased the number of species assigned, confirming previous studies’ results improving species assignments with regional libraries (
Comparing the ranking categories of NCBI-nt and GSL-rl revealed an important improvement in reliability with our annotated regional library (Fig.
Our results showed that the use of a regional library increases both the reliability and number of species detected in an eDNA metabarcoding dataset. Yet, some species likely present in the Gulf of St. Lawrence were only detected with NCBI-nt, as discussed in the previous section. The growth of GSL-rl will increase the number of species that can be detected using the regional library, but unexpected species, such as new invasive species or species that have recently expanded their distribution, could remain undetected (
Combining the strengths of a regional library with that of public repositories in a two-step approach is consequently the optimal solution to maximize reliability and number of species assigned in metabarcoding studies. Taxonomic assignments should be first performed with a regional library, ideally including a reliability ranking system as in the GSL-rl, to maximize the confidence in species assignment. We then strongly advise contrasting species assignment results from a regional library with those using a public repository to increase the number of species detections (see also
We also encourage further benchmarking studies for the selection of optimal methods based on a broader comparison of assignment methods and the development of training sets for machine-learning methods. The choice between a more (e.g., BLAST-LCA) or less conservative approach (e.g., BLAST-TopHit) for species assignments should also reflect the study objectives. Our study had limited comparison of assignment methods. We selected methods often used in eDNA metabarcoding studies that are also performing relatively well in benchmarking assignment studies (
We thank Grégoire Cortial and Jade Larivière for their inputs at the earlier stages of this study. We also thank Nick Jeffery, Christopher Hempel and two anonymous reviewers for helpful comments on previous versions of the manuscript. We thank Yanick Gendreau and Sandra Velasquez from the Coastal environmental baseline program and Geneviève Faille and Geneviève Côté from the Banc-des-Américains Marine Protected Area for eDNA sampling and the initial list of marine faunal species of interest.
Creation of Gulf of St. Lawrence regional library (GSL-rl) and creation of an eDNA metabarcoding dataset
Data type: 7z. Arhive
Explanation note: List of species retrieved in BOLD under different names. List of taxa BIN number removed. Steps to obtain, filter and select publicly available sequences. Metadata of the GSL-rl, including ranking systems. [csv file]. BINs shared by two or more taxa. Taxa sharing more than one BIN within the GSL-rl. Characteristic of the curated regional library of the Gulf of St. Lawrence (GSL-rl) version 1.0. Sampling locations for the metabarcoding dataset in the St. Lawrence. Schematic representation of the metabarcoding bioinformatics pipeline.