Research Article |
Corresponding author: Filipe O. Costa ( fcosta@bio.uminho.pt ) Academic editor: Kelly D. Goodwin
© 2024 João T. Fontes, Kazutaka Katoh, Rui Pires, Pedro Soares, Filipe O. Costa.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Fontes JT, Katoh K, Pires R, Soares P, Costa FO (2024) Benchmarking the discrimination power of commonly used markers and amplicons in marine fish (e)DNA (meta)barcoding. Metabarcoding and Metagenomics 8: e128646. https://doi.org/10.3897/mbmg.8.128646
|
Environmental DNA (eDNA) metabarcoding is revolutionising the study of aquatic ecosystems, enabling high-throughput analysis of biodiversity with minimal disturbance. Despite its potential to support fisheries management, species identification and downstream analysis reliability are hindered by the lack of standardisation in DNA fragment choice. This study compares the species discrimination power of three markers used in marine fish (e)DNA (meta)barcoding – 12S rRNA, 16S rRNA and cytochrome oxidase subunit I (COI) – as well as two amplicons for each. We analysed sequences from NCBI GenBank for 10 orders of Actinopterygii (ray-finned fishes), including mitochondrial genomes. We assessed species discrimination by determining the percentage of monophyletic species in Neighbour-Joining trees and calculating congeneric divergences for two datasets: one with genomic regions extracted from mitochondrial genomes (771 species) and another with independent sequences for each region (3,879 species). Amongst (meta)barcoding amplicons in the mitochondrial genomes’ dataset, the COI Folmer and Leray-Lobo regions had the highest discriminatory power, with 89.2% and 87.0% monophyletic species, respectively, while the 12S Teleo region had the lowest at 71.6%. Conversely, using independent sequences of these amplicons, Folmer and Leray-Lobo had the lowest percentages of monophyletic species, at 64.8% and 63.5%, respectively, while Actinopterygii 16S (Ac16S) had the highest at 83.0%. Species discrimination is influenced by the marker’s evolutionary rate, fragment length, target fish order and the quality of reference sequence data. We recommend considering species discriminatory power differences for amplicon selection, especially for species-level identifications. We advise a standard multi-marker approach under certain scenarios, particularly when the presence of close congeneric species is expected.
Actinopterygii, environmental DNA, fisheries management, mitochondrial genome, molecular markers, species identification, taxonomic resolution
The study of aquatic ecosystems is on the verge of a revolution due to emerging molecular tools, namely (e)DNA (meta)barcoding. With minimal disturbance to the environment, eDNA metabarcoding makes use of the genetic material that is naturally shed by organisms to detect species using a high-throughput approach (
Fundamentally, the bioinformatics in taxonomic assignment during (e)DNA metabarcoding involves comparing high-throughput sequencing (HTS) reads of short-sized amplicons with reference sequences to identify species from an environmental or bulk sample (
Being the standard marker for DNA barcoding, COI is the backbone of animal species identification using molecular approaches (
Both 12S and 16S have reported technical merits that make them useful for species detection (
Challenges for this type of analysis include the fragmented nature of DNA sequence data in public repositories, particularly for 12S and 16S, since there is no standardised region established for these markers, unlike the COI barcode region (
In this study, our goal was to conduct a comprehensive assessment of the capacity of commonly used markers in fish (e)DNA (meta)barcoding — 12S, 16S and COI — and six of their respective amplicons, to discriminate between species across 10 different orders of commercially and ecologically relevant marine Actinopterygii, thereby benchmarking their level of taxonomic resolution for (e)DNA-based studies. We utilised the concept of monophyly as the basis for determining species discriminatory power as it simplifies the quantification of taxonomic resolution by examining the congruence of sequences within specific clades.
Our analyses were conducted using two distinct datasets comprising publicly available sequences from NCBI GenBank: a) a dataset of sequences for the three markers extracted from mitochondrial genomes, thereby allowing direct comparison of discrimination ability between markers and amplicons for the same individuals and b) a dataset including independent sequences of the three full-gene markers, as well as their respective amplicons, allowing a comprehensive analysis of the discrimination capacity, based on the extant data. In addition, we calculated and compared each fragment’s congeneric divergence to further grasp the potential reasons behind differences in discrimination capability.
This comprehensive evaluation highlights the critical need for standardisation in the selection and application of (e)DNA metabarcoding markers in marine fish studies. Establishing a standardised approach to marker choice, including their demonstrated taxonomic resolution and discrimination capability as criteria, will ensure the reliability and comparability of biodiversity assessments across different marine ecosystems. By benchmarking the effectiveness of these markers and their respective amplicons, our study supports the development of consensus guidelines that can enhance the accuracy of species-level identifications.
We mined 12S, 16S and COI sequences for Actinopterygii from NCBI GenBank (
Using R (
To diminish the impact of indels, particularly in the cases of 12S and 16S regions and to obtain more reliable multiple sequence alignments, we made subsets of each marker’s datasets into five groups of fishes. Each group comprised a pair of closely-related orders that were well-represented, meaning they had a substantial number of sequences, species and sequences per species. These pairings were selected according to the phylogeny reconstructed by
Using the datasets obtained for each group of marine fishes and each marker, we extracted the following genomic regions (Fig.
To extract the regions from the mined sequences, we used a MAFFT (
For consistency, we used query sequences from records containing all target regions, hereby focusing on complete or nearly complete mitochondrial genomes. For each pair of fish orders, we used one query sequence from each of the better-represented families in that group’s mitochondrial genome datasets (i.e. families representing at least 10% of the dataset).
A custom Python script identified the primer annealing regions for the 12S and 16S amplicons to obtain the query sequences. For the full COI, 12S and 16S gene regions, which are readily annotated in the GenBank mitochondrial genome records, we downloaded sequences directly from the respective records’ pages. For the Folmer region, since it usually lacks indels, we used ClustalW (
To determine and compare species-level discrimination power between regions, we solely used GenBank records where all target regions in Fig.
We realigned the sequences using MAFFT and proceeded to construct Neighbour-Joining trees in MEGA 11, with 1,000 bootstrap replicates. To ensure consistency and estimate reliable trees for shorter regions (e.g. Teleo, MiFish-U or Berry), we used the uncorrected pairwise distance as the substitution model, with pairwise deletion for missing sites.
After obtaining the trees, we used the MonoPhy (
For each genus, we used the msa R package (
We then performed the same analysis on the complete datasets, using sequences locally aligned for each independent genomic region. Due to computational restraints imposed by larger datasets (e.g. the Folmer region dataset for PS comprised over 15,000 sequences), we kept only two duplicate haplotypes per species for tree construction. We kept two duplicates instead of one to allow singletons the possibility of becoming intruders for species with only one haplotype, but more than one sequence. For calculating congeneric divergence values, we removed all duplicate haplotypes within each species. Preliminary analyses without duplicate removal showed no differences in results.
For the complete datasets, since they are not directly comparable between genomic regions, we calculated the percentage of monophyletic species and identified the reasons for non-monophyly: i) the percentage of non-monophyletic species sharing a haplotype with at least one sequence from another species, ii) the percentage of non-monophyletic species with at least one intruder, iii) the percentage of non-monophyletic species with at least one outlier (i.e. a sequence from a species clustering outside its species’ clade) and iv) the percentage of non-monophyletic species with both an intruder and an outlier. These percentages were calculated excluding singletons, as they cannot have intruders or outliers, though they can become either. The R and Python scripts used in this study are available at https://github.com/tadeu95/Species_Discrimination. Additional methodology details are provided in Suppl. material
The number of sequences and species that were locally aligned to the reference records mined from GenBank are represented in Table
Number of sequences and species locally aligned using MAFFT for the five groups of marine fishes, for each genomic region analysed.
Region | Sequences | Species |
---|---|---|
Berry | 10,482 | 2,061 |
Ac16S | 4,078 | 1,521 |
Full 16S | 2,845 | 1,069 |
Teleo | 4,136 | 1,423 |
MiFish-U | 7,780 | 2,117 |
Full 12S | 3,127 | 1,104 |
Leray-Lobo | 49,757 | 3,314 |
Folmer | 48,331 | 3,310 |
Full COI | 4,168 | 1,064 |
The dataset of mitochondrial genomes comprised 2,094 sequences across 771 species. The percentage of monophyletic species for the overall dataset, as well as for each group of fishes and genomic region, is shown in Fig.
Bar plots displaying the overall and per group percentage of monophyletic species obtained in the analysis using mitochondrial genomes. For each genomic region, CA datasets had 403 sequences and 114 species, GK datasets had 282 sequences and 118 species, PS datasets had 774 sequences and 281 species, PC datasets had 304 sequences and 133 species and SS datasets had 331 sequences and 125 species.
Box plots displaying the overall and per group average congeneric pairwise distance for each genus, calculated in the analysis using mitochondrial genomes. For each genomic region, CA had 344 sequences, 84 species and 23 genera; GK had 178 sequences, 75 species and 27 genera; PS had 564 sequences, 183 species and 43 genera; PC had 204 sequences, 87 species and 30 genera; and SS had 232 sequences, 76 species and 23 genera.
Scatter plots displaying the pairwise congeneric divergence values for six pairs of genomic regions. Each data point represents the pairwise distance between two sequences from different species within the same genus, derived from the same mitochondrial genomes for each pair of regions.
Number of haplotypes observed in the mitochondrial genome dataset for each of the nine genomic regions analysed. Each dataset comprised 2,094 mitogenomes for 771 species.
Region | No. of Haplotypes |
---|---|
Berry | 903 |
Ac16S | 914 |
Full 16S | 1,239 |
Teleo | 720 |
MiFish-U | 820 |
Full 12S | 1,074 |
Leray-Lobo | 981 |
Folmer | 1,096 |
Full COI | 1,276 |
The overall percentage of monophyletic species obtained for the complete datasets of each genomic region and the proportion of reasons behind non-monophyly, are represented in Table
Number of sequences, species, average number of sequences per species (SPS), percentage of monophyletic species, percentage of non-monophyletic species that share a haplotype with another species, percentage of non-monophyletic species with at least one intruder sequence, percentage of non-monophyletic species with at least one outlier sequence and percentage of non-monophyletic species with at least one intruder and one outlier sequence.
Region | Sequences | Species | Average SPS | % Monophyletic | % Non-monophyletic sharing haplotype | % Non-monophyletic with intruder | % Non-monophyletic with outlier | % Non-monophyletic with intruder and outlier |
---|---|---|---|---|---|---|---|---|
Berry | 5,752 | 2,061 | 2.8 | 70.7 | 66.6 | 39.0 | 44.8 | 16.2 |
Ac16S | 3,064 | 1,521 | 2.0 | 83.0 | 48.3 | 48.9 | 42.0 | 9.1 |
Full16S | 2,561 | 1,069 | 2.4 | 90.7 | 24.2 | 54.2 | 41.0 | 4.8 |
Teleo | 2,647 | 1,423 | 1.9 | 77.1 | 70.3 | 35.1 | 56.8 | 8.1 |
MiFish-U | 4,976 | 2,117 | 2.4 | 75.2 | 67.7 | 45.9 | 45.7 | 8.4 |
Full12S | 2,623 | 1,104 | 2.4 | 86.9 | 33.8 | 57.9 | 35.5 | 6.6 |
Leray-Lobo | 26,275 | 3,314 | 7.9 | 63.5 | 49.6 | 38.6 | 37.9 | 23.5 |
Folmer | 34,261 | 3,310 | 10.4 | 64.8 | 28.2 | 38.3 | 38.7 | 23.0 |
FullCOI | 3,886 | 1,064 | 3.7 | 89.5 | 16.1 | 51.0 | 45.1 | 3.9 |
Overall congeneric distance metrics for the datasets comprising independent sequences can be observed in Table
Maximum average congeneric divergence (%), mean of the average congeneric divergence (%), standard error (SE), number of pairwise comparisons, number of species, number of genera and number of sequences for all genomic regions analysed.
Region | % Max. | % Mean | SE | Pairwise comparisons | Species | Genera | Sequences |
---|---|---|---|---|---|---|---|
Berry | 31.1 | 8.8 | 0.4 | 59,727 | 1,632 | 359 | 3,459 |
Ac16S | 32.1 | 6.8 | 0.3 | 10,486 | 1,109 | 287 | 1,667 |
Full16S | 24.5 | 6.5 | 0.4 | 7,382 | 725 | 202 | 1,332 |
Teleo | 36.9 | 11.8 | 0.6 | 5,797 | 1,019 | 252 | 1,289 |
MiFish-U | 25.1 | 6.8 | 0.3 | 22,704 | 1,702 | 366 | 2,827 |
Full12S | 19.4 | 5.3 | 0.3 | 7,145 | 759 | 199 | 1,359 |
Leray-Lobo | 24.4 | 11.2 | 0.3 | 1,072,424 | 2,858 | 495 | 17,658 |
Folmer | 22.9 | 10.3 | 0.3 | 2,472,090 | 2,857 | 497 | 25,217 |
FullCOI | 23.0 | 9.5 | 0.4 | 206,573 | 764 | 200 | 2,479 |
As long as substantial efforts are made towards the harmonisation of protocols and the completeness and curation of reference libraries, (e)DNA metabarcoding is likely to become one of the main drivers behind biodiversity assessment (
The COI barcode regions (i.e. Leray-Lobo and Folmer regions) provided the most comprehensive coverage in terms of both the number of species and sequences locally aligned (Table
Since our analysis of species discriminatory power was conducted using mitochondrial genomes, we were able to compare it without the biases introduced by potential differences in reference sequence quality between markers. This is because, for each genomic region, we compared sequences obtained from the same individual organisms. Therefore, even if occasional misidentifications might be present, they did not disrupt the discriminatory power comparison between markers and amplicons. At most, misidentifications could potentially diminish the percentage of monophyletic species across all nine genomic regions for a specific group of fishes. In fact, across the nine genomic regions analysed, SS was the group of fishes with the lowest percentage of monophyletic species, while, in contrast, all regions in GK had at least 80% of monophyletic species (Fig.
The two longest fragments, full 16S and full COI, were able to discriminate overall the highest proportion of species, in four out of the five groups of fishes. As fragment length increases, sequences are more likely to include a larger number of informative sites that enable species-level discrimination. Similarly, amongst the 12S and 16S amplicons analysed, the longer 16S amplicons had a higher percentage of monophyletic species than the shorter 12S amplicons in four out of the five groups (GK, PS, PC and SS). While, generally, longer fragments are more capable of discriminating at the species level (
Higher mean and median values of average congeneric divergence were expected in the case of COI (Fig.
The scatter plots in Fig.
It has been shown that synonymous substitutions are much more prevalent than non-synonymous substitutions for COI in fishes (
While there is a recognisable difference in discriminatory power between genetic markers and amplicons, other crucial factors should be contemplated for amplicon selection. For instance, technical aspects, such as annealing temperature during PCR or the number of PCR replicates, can influence species detection from environmental samples (
Overall, Teleo had the lowest percentage of monophyletic species (especially in PS) presumably because it comprises the shortest analysed fragment, with its length averaging below 70 base pairs (Fig.
Considering the merits and demerits of different markers, certain scenarios may arise where the employment of more than one marker can become necessary. The decision on whether to use a multi-marker approach is ultimately dependent on the nature and objectives of a metabarcoding study, as well as the target ecosystem and sampling locations. For instance, if achieving species-level identifications is not critical for the study’s objectives, the usage of just one marker, even if it provides lower discriminatory power, can be effective for ichthyofauna surveys where genus or family-level identifications are sufficient (Thekiso et al. 2023
Conversely, in areas with high ichthyofauna diversity, including numerous congeneric species (
Although utilising several markers is a good starting point when aiming for species-level identifications, there were cases where no marker was able to fully discriminate certain species. For instance, two Clupeiform species, Alosa alosa and Alosa fallax, were not fully discriminated by any of the genomic regions. These two species are known to hybridise extensively (
For the analysis using each genomic region’s independent sequences, Table
It is evident that, as the number of sequences and species increases, there is a reduction in the proportion of monophyletic species, even if the marker itself is usually capable of discriminating at the species-level. Table
The datasets of shorter fragments such as Berry, Teleo and MiFish-U have a higher percentage of non-monophyletic species sharing a haplotype with another species (66.6%, 70.3% and 67.7%, respectively), further corroborating their lower discriminatory power. During taxonomic assignment in (e)DNA metabarcoding, one ambiguous sequence can be enough to obtain a false positive or false negative. Therefore, if a metabarcoding-based ichthyofauna survey targets closely-related congeneric species, the standard approach should be to use several fragments to allow the downstream rectification of potential ambiguities. In addition to the species discriminatory power, it is essential to consider the importance of auditing and curating reference sequence databases to obtain accurate taxonomic assignments. Although the reference libraries for COI have been subject to many auditing and curation efforts (
Our comprehensive analysis of the species discrimination ability of the most common markers employed in marine fish (e)DNA metabarcoding helped benchmark the limitations and differences amongst them. Considering how different amplicons and markers yield different results both in terms of reference database completeness and discriminatory power, one should be careful not to overestimate the current effectiveness of (e)DNA metabarcoding, especially when using only one amplicon or marker under scenarios of high ichthyofauna diversity. While gaps in reference sequence data coverage between different fragments can be mitigated in the future, differences in species discriminatory power have inherent and immutable biological reasons and may lead to systematic loss of species-level data in (e)DNA-based studies. Hence, we warn of the relevance of considering species discrimination ability as one of the main criteria when selecting amplicons for marine fish metabarcoding studies.
Even though COI remains the standard genetic marker for animal species identification, with its regions having the highest discriminatory power, it is important to recognise the advantages and disadvantages of each amplicon since no genomic region is universally optimal. Thus, careful consideration is fundamental, as each region has its own set of limitations and affinities and no single marker is perfectly suited to all studies or ecosystems. The choice of amplicon must be tailored to the specific requirements of the research question and the ecological context.
Therefore, to maximise species identification accuracy and minimise the risk of false positives and negatives, we advocate the use of multiple target regions in (e)DNA metabarcoding studies aiming for species-level identifications of closely-related marine fish taxa, while carefully integrating and curating the information obtained. A multi-marker approach will help resolve the ambiguities that may arise from certain genomic regions not having enough discriminatory power, while compensating for the technical limitations of certain amplicons. Especially for the potential use in fishery stock assessment, where precise discrimination amongst closely-related species is of extreme importance, we recommend the use of multiple amplicons as the standard approach to harness the full potential of (e)DNA metabarcoding.
The authors have declared that no competing interests exist.
No ethical statement was reported.
This work was funded by the project “A-FISH-DNA-SCAN: Cutting-edge DNA-based approaches for improved monitoring and management of fisheries resources along Magellan-Elcano’s Atlantic route” (CIRCNA/BRB/0156/2019; http://doi.org/10.54499/CIRCNA/BRB/0156/2019) funded by the Portuguese Foundation of Science and Technology (FCT, I.P.), by the CBMA “Contrato-Programa” (UIDB/04050/2020; https://doi.org/10.54499/UIDB/04050/2020) and by the ARNET “Contrato-Programa”(LA/P/0069/2020; https://doi.org/10.54499/LA/P/0069/2020), funded by national funds through FCT I.P. J.T.F. (UI/BD/150910/2021; https://doi.org/10.54499/UI/BD/150910/2021) is supported by the Collaboration Protocol for Financing the Multiannual Research Grants Plan for Doctoral Students with financial support from FCT I.P. and the European Social Fund under the Northern Regional Operational Program – Norte2020. K.K. is supported by JSPS KAKENHI, Grant Number JP20K06767.
Conceptualisation, J.T.F., P.S. and F.O.C.; methodology, J.T.F, K.K., P.S. and F.O.C; software, J.T.F., K.K. and R.P., validation, J.T.F, K.K. and R.P., formal analysis, J.T.F, writing – original draft, J.T.F, P.S. and F.O.C, writing – review and editing, J.T.F, K.K., R.P., P.S. and F.O.C.
João T. Fontes https://orcid.org/0000-0002-8766-4779
Kazutaka Katoh https://orcid.org/0000-0003-4133-8393
Pedro Soares https://orcid.org/0000-0002-2807-690X
Filipe O. Costa https://orcid.org/0000-0001-5398-3942
All of the data that support the findings of this study are available in the main text or Supplementary Information.
Strings of search motives to query the GenBank accessions annotated descriptions and titles
Data type: pdf
Additional methodology details for conducting species discrimination power analysis using sequences from NCBI GenBank
Data type: pdf
GenBank accession numbers for records utilized in the study
Data type: csv
Number of sequences and species locally aligned for each genomic region
Data type: xlsx
Average congeneric divergence values using mitochondrial genomes
Data type: xlsx
Average congeneric divergence values using independent sequences
Data type: xlsx