Emerging Technique |
Corresponding author: Kristen M. Westfall ( westfall.kristenm@gmail.com ) Academic editor: Tiina Laamanen
© 2024 Kristen M. Westfall, Gregory A. C. Singer, Muneesh Kaushal, Scott R. Gilmore, Nicole Fahner, Mehrdad Hajibabaei, Cathryn L. Abbott.
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Westfall KM, Singer GAC, Kaushal M, Gilmore SR, Fahner N, Hajibabaei M, Abbott CL (2024) NAMERS: a purpose-built reference DNA sequence database to support applied eDNA metabarcoding. Metabarcoding and Metagenomics 8: e125095. https://doi.org/10.3897/mbmg.8.125095
|
Applied eDNA metabarcoding is increasingly being considered as a tool to inform management decisions, regulations, or policy development. Because these downstream considerations are coming to the forefront of eDNA applications, optimizing workflow elements is essential to increasing standardization, efficiency, and competency of metabarcoding results. Reference DNA sequences are critical workflow elements that currently lack consistent approaches to generating, curating, or publishing. We present a complete mitochondrial genome and nuclear ribosomal DNA cistron reference DNA sequence library for 92% of the freshwater fish species of British Columbia, Canada. This resource is published as the Novel Applied eDNA Metabarcoding Reference Sequences (NAMERS) repository (https://namers.ca), a user-friendly and interactive website for specialists and non-specialists alike to explore and generate custom reference libraries for taxa and genes of interest. We demonstrate the power of NAMERS for optimization of applied eDNA metabarcoding study design by analyzing the number of primer mismatches and species resolution power of existing metabarcoding markers. NAMERS demonstrates that high quality curated genomic information is within a reasonable reach to meet the increasing demand for actionable eDNA metabarcoding applications. The framework used here incorporating the pillars of accuracy, completeness and accessibility can be applied for new iterations of other reference sequence databases to bring DNA-based monitoring into a new era.
Biodiversity, environmental DNA, fish, freshwater, species at risk, standardization
The need has never been greater for information rich biomonitoring data to assess environmental impacts, monitor rare species, chart ecosystem trajectories, and evaluate remediation and conservation efforts - all to ultimately maximize positive desired outcomes. There is great current attention on translating the science of eDNA metabarcoding into practices that benefit humankind (
Environmental DNA metabarcoding results that directly inform a management or regulatory decision require a higher level of confidence than results that are used in research and development to advance the science. Confidence in eDNA results is elevated by implementing strict quality criteria and standards throughout the workflow, and defining these quality criteria and standards becomes critical as eDNA gains popularity with government organizations with an eye to using this technology to inform policy, regulation, or management actions. Here we adopt the term applied eDNA metabarcoding to distinguish between end uses of eDNA results; the term applied meaning results is potentially used to inform management decisions, regulation, policy, etc. Just as
DNA barcoding emerged as a powerful tool for genetic species identification before the naissance of eDNA metabarcoding almost a decade later (
There are three foundational pillars of reference DNA sequence databases that underpin their use in applied eDNA metabarcoding to increase end user confidence in results; quality, completeness, and accessibility. The unique needs related to quality and completeness are false positives and false negatives (
Here we present the development of a mitogenome (and ribosomal DNA cistron) reference sequence database and web portal that combines these three pillars into a single framework, representing a level of functionality that is not achieved by any current single existing repository. NAMERS: Novel Applied eDNA Metabarcoding Reference Sequence database, is a database of whole mitogenomes and nuclear ribosomal (nr) DNA cistrons for freshwater fish in British Columbia, Canada. The majority of sequences in NAMERS come from vouchered specimens with permanent records in public institutions that have an associated minimum of standard biodiversity metadata terms that include geo-referenced collection sites, dates, and taxonomic identification. NAMERS contains mitogenomes and nrDNA for 92% of all freshwater fish species in BC, thus offering high genetic and taxonomic completeness. Lastly, the NAMERS database is available in a user-friendly web portal that combines functionality with ease-of-use, requiring no bioinformatics experience. Users can easily view multiple sequence alignments for taxa and genes of their choice, and download customized reference DNA sequence data with a few clicks. This level of accessibility by specialists and non-specialist end users is not readily achieved by other repositories.
The NAMERS framework is based on the following as foundational premises of applied eDNA metabarcoding: (i) primer design is a key factor determining success (
The framework presented here fills an innovation gap between the existing state of most reference DNA sequence databases and what is needed for managers and other end users when considering the downstream applications of eDNA metabarcoding results. This framework is currently presented as a proof-of-concept at the regional scale. Although there are immediate benefits for the management of freshwater fish in BC, the value of this framework goes well beyond this scale and we promote its use at the national and international levels. A framework like this has not thus far been implemented at larger scales due to the lack of organization and long-term funding. Environmental DNA is quickly gaining momentum with releases of the US’s National Aquatic Environmental DNA Strategy (
To maintain consistently high quality of information in the NAMERS database, we aimed to satisfy three criteria for all species: (1) available museum-catalogued voucher specimen; (2) minimum voucher specimen metadata consisting of collection site name, geographic location, sampling date, and the name and affiliation of who did the morphological identification; and (3) genetic species identity verification using COI barcode sequences.
British Columbia (BC) has approximately 92 (75 native and 17 invasive) fish taxa that use freshwater for all or part of their life cycle, including significant geographical variants or subspecies (
In most cases we obtained DNA extractions from vouchered specimens but also obtained frozen/ethanol preserved tissue from several museums. For 18 taxa with Institution = PBS in Suppl. material
Total genomic DNA was extracted from fin or muscle tissue using the Qiagen DNeasy Blood and Tissue kit and quantified using Quant-iT™ PicoGreen Assay (ThermoFisher). Input amounts normalized to 10 ng were used to build Illumina DNA libraries, which were sequenced on a NovaSeq SP flow cell (2 × 250 bp) at a target sequencing depth of 5 million reads per sample. For the subset of samples with mitochondrial genome and nuclear rDNA cistron coverage below 20-fold after the first run, the same library was sequenced using another Illumina NovaSeq SP flow cell (2 × 250 bp kit) to increase read depth by 1 to 14M reads per sample. For the subset of samples for which mitochondrial genome assembly was not possible after the first run, the original genomic DNA was used in a secondary independent Illumina DNA library preparation with minor modifications for low DNA input. This was then sequenced using the Illumina NovaSeq SP flow cell (2 × 250 bp kit) at a target sequencing depth of 10 million reads per sample.
Raw sequencing data were demultiplexed and trimmed of indices using Illumina’s bcl2fastq (version 2.20.0.422) software. For each sample, trimmomatic (version 0.39) (
To ensure traceability of sequence data to physical voucher specimens and minimize the likelihood of sequencing misidentified material, tissues to be sequenced were predominantly sourced from museum collections, as follows: the Royal Ontario Museum (n = 45); the University of British Columbia’s Beaty Biodiversity Museum (n = 13); and the University of Washington Burke Museum Ichthyology Collection (n = 5). Exceptions to this included 9 tissue samples from the Beaty Biodiversity Museum, collected and identified by fish collection director Dr. Eric B. Taylor (E. Taylor, pers. comm.); six of which have voucher specimens that are not catalogued and four of which had no voucher specimen (both bull trout lineages, Salvelinus confluentus; lake trout, Salvelinus namaycush; and longnose sucker, Catostomus catostomus). Tissue from the inconnu (Stenodus leucichthys), collected by the Teslin Tlingit Council, also does not have a whole voucher specimen but has a tissue voucher housed at the Pacific Biological Station (Nanaimo, BC). The remainder of voucher specimens were collected by Fisheries and Oceans Canada (n = 19) and are currently being catalogued at the Royal British Columbia Museum. All voucher information is included in Suppl. material
To verify concordance between morphological taxonomy and molecular taxonomy, whole COI sequences from each mitogenome were manually aligned and inspected with DNA barcodes produced by
The exception was lampreys (Petromyzontidae), which are not in the Canadian Freshwater Fish Barcode Database (
NAMERS sequences and associated metadata were deposited in GenBank (under the BC Freshwater Fish Genome Project) and in a newly developed, purpose-built online mitogenome and nuclear rDNA cistron sequence data portal specifically for applied eDNA metabarcoding (https://namers.ca). Specific functionalities of the portal are summarized in Results.
Assessments of amplification and taxonomic resolution efficiencies of genetic markers are critical for sound applied eDNA metabarcoding study design as both are key determinants of success (
Information on markers assessed for primer mismatches and species level resolution in 82 freshwater fish species. 1Markers presented at the Family level in Fig.
Gene | Primer Name | Reference | Forward Sequence (5’-3’) | Reverse Sequence (5’-3’) | Amplicon Range (bp) | Maximum mismatches for F/R primers |
---|---|---|---|---|---|---|
12S | Teleo1 | ACACCGCCCGTCACTCT | CTTCCGGTACACTTACCATG | 61–64 | 7/1 | |
Teleo2 | AAACTCGTGCCAGCCACC | GGGTATCTAATCCCAGTTTG | 164–177 | 1/1 | ||
MiFishU1 | GTCGGTAAAACTCGTGCCAGC | CATAGTGGGGTATCTAATCCCAGTTTG | 168–181 | 3/2 | ||
AcMDB071 | ( |
GCCTATATACCGCCGTCG | GTACACTTACCATGTTACGACTT | 241–282 | 1/1 | |
Am12S | ( |
AGCCACCGCGGTTATACG | CAAGTCCTTTGGGTTTTAAGC | 237–253 | 1/3 | |
Ac12S | ( |
ACTGGGATTAGATACCCCACTATG | GAGAGTGACGGGCGGTGT | 370–392 | 2/1 | |
12S_V5 | ( |
ACTGGGATTAGATACCCC | TAGAACAGGCTCCTCTAG | 89–107 | 1/1 | |
Ac16S | ( |
CCTTTTGCATCATGATTTAGC | CAGGTGGCTGCTTTTAGGC | 321–341 | 2/5 | |
Shaw16S | ( |
CGAGAAGACCCTWTGGAGCTTIAG | GGTCGCCCCAACCRAAG | 56–80 | 3/3 | |
Vert 16S1 | ( |
AGACGAGAAGACCCYTGGAGCTT | GATCCAACATCGAGGTCGTAA | 237–278 | 1/0 | |
L2513/H27141 | ( |
GCCTGTTTACCAAAAACATCAC | CTCCATAGGGTCTTCTCGTCTT | 201–205 | 2/1 | |
Fish16SF-16S2R | ( |
GACCCTATGGAGCTTTAGAC | CGCTGTTATCCCTADRGTAACT | 188–216 | 5/1 | |
CO1 | SeaDNA-short1 | ( |
GGAGGCTTTGGMAAYTGRYT | GGGGGAAGAARYCARAARCT | 55 | 4/4 |
LerayXT1 | ( |
GGWACWRGWTGRACWITITAYCCYCC* | TAIACYTCIGGRTGICCRAARAAYCA* | 313 | 4/0 | |
seaDNA-mid | ( |
GGAGGCTTTGGMAAYTGRYT | TAGAGGRGGGTARACWGTYCA | 130 | 4/5 | |
Minibar | ( |
TCCACTAATCACAARGATATTGGTAC | GAAAATCATAATGAAGGCATGAGC | 127 | 5/8 | |
CYTB | Minamoto-fish1 | ( |
TTCCTAGCCATACAYTAYAC | GGTGGCKCCTCAGAAGGACATTTGKCCYCA | 235 | 4/8 |
FishCBL/FishCBR | ( |
TCCTTTTGAGGCGCTACAGT | GGAATGCGAAGAATCGTGTT | 90 | 9/6 | |
Fish2CBL/Fish2bCBR | ( |
ACAACTTCACCCCTGCAAAC | GATGGCGTAGGCAAACAAGA | N/A | 6/6 |
Mitogenomic data were generated here for an estimated 92.3% (85/92) of all freshwater fish taxa present in our target geographic area of BC, Canada, representing 49 genera and 19 families. Sequencing success rates were high; 82 of 85 taxa sequenced returned complete or near complete (missing one gene or few partial genes) mitogenomes and a further three returned partial mitogenomes. Thus the final data set is comprised of complete or near complete mitogenome sequences for ~89% of all freshwater fish taxa in BC (82/92) and partial mitogenomes for an additional two species and one lineage. Mitogenome sequencing depth ranged from 1.4 to 2249.9 (median 101.2) and mitogenome length ranged from 14.198–18.141 kbp (median 16.634 kbp). All species for which the full mitogenome was constructed contained 13 protein-coding genes (COX1 – COX3, CYTB, ND1 – ND6, ND4L, ATP6, and ATP8), 22 tRNA genes, and two rRNA genes (small and large rRNA subunits). Full nrDNA cistrons containing 5.8S, 18S, and 28S regions were sequenced for 70 species, with sequencing depth ranging from 16.5 to 1340.8 with a median of 413.4. Full details on mitogenome and nrDNA data are in Suppl. materials
The morphological taxonomy of each specimen in NAMERS was verified using genetic identification by the COI barcode region in almost all instances, with a few exceptions as follows. The candidate Umatilla dace (Cyprinidae; Rhinichthys umatilla) specimen was a misidentified sucker (Family Catostomidae), which is highly plausible given the difficulty identifying juvenile fish, and hence excluded. For lampreys, COI and 12S genes were invariable within genera; however, the 16S gene differentiated Entosphenus species and the cytochrome b (CYTB) gene differentiated Lampetra species (excluding the Morrison creek variant), by a single base in all cases. The ND4 gene had highest genetic variation among lamprey species, with three base changes between the two Entosphenus species and two changes between the two Lampetra species (again excluding the Morrison Creek variant); suggesting that ND4 may be a candidate gene for species specific markers in this family.
Whole mitogenomes, annotated mitochondrial genes, and annotated nuclear ribosomal genes are available to view and download in FASTA format on the new NAMERS portal. The main database page offers a table of 86 species grouped by increasing taxonomic levels. Users can highlight any taxonomic level to view available sequence data for all included taxa, from individual species to family, and can easily customize batches of particular genes or taxa for downloading in FASTA format. They can also highlight particular genes of interest or the complete mitogenome for automatic alignments (using MUSCLE,
The number of primer mismatches and the proportion of species resolved for 19 published fish and vertebrate metabarcoding markers was assessed using the NAMERS database (Table
Number of primer mismatches (right panel) and proportion of species resolved (left hand panel) for 19 metabarcoding markers from four gene regions, generated using all species in NAMERS (n = 86). Forward primer mismatches are depicted by dark circles and reverse primer mismatches by light circles. Superscripts indicate the number of ambiguous bases in the forward and reverse primers, respectively. Markers with no superscripts have no ambiguous bases in either primer. Darker bar area shows species resolution when lamprey are included and the entire bar when they are excluded.
Achieving high confidence species level resolution within a family will at times be more important than surveying across all families, as some applied eDNA metabarcoding surveys will focus on lower taxonomic groups only, depending on the specific survey aim. The number of primer mismatches and proportion of species resolved for families (n = 6) with more than five species (up to 63 species) are shown in Fig.
Family level plots of the number of primer mismatches, depicted by solid black symbols and the right-hand y-axis (forward primer = black triangle, reverse primer = black circle), and the proportion of species resolved by unique amplicons defined as a minimum of 1 bp difference including indels, depicted by coloured bars and the left hand y-axis. Plots are for all families in the NAMERS database with a minimum of five species and a subset of eight of the primers tested in Fig.
Current unprecedented rates of global change and biodiversity loss demand innovative and efficient tools for monitoring and managing ecosystems (
As defined earlier, the term applied eDNA metabarcoding encompasses unique quality- and confidence-related needs that come with translating this eDNA method into practical application. Here we introduced a framework combining three foundational concepts of quality, completeness, and accessibility, to improve reference sequence repositories to meet the unique needs of applied eDNA metabarcoding. Although NAMERS is a region-specific database exemplifying this framework, these concepts can be applied to new iterations of reference sequence databases around the world as this technology is increasingly integrated into management models. We advocate for establishing large-scale databases with long-term funding models that incorporate the foundational concepts described here. The regionally-focused NAMERS database may not include the taxonomic breadth for studies of anthropogenic-mediated introductions or climate-mediated shifts in species distributions. However, there are clear advantages of the three foundational pillars for reference DNA data demonstrated in NAMERS that are not present in any other single existing repository.
Species richness can be underestimated by indiscriminate application of metabarcoding markers without a full understanding of their specificity for target groups or level of species coverage within the reference library used for taxonomy assignments (
Taxonomically complete reference libraries like NAMERS also allow species resolution to be assessed as part of the survey design phase, which may be especially important when specific taxonomic groups are targeted. In our analyses, even though average species resolution was greater for the protein coding COI and CYTB genes as expected, the 12S and 16S genes had lower average rates of primer mismatch and would therefore likely recover more freshwater fish species when conducting taxonomically broad surveys. Family- or genus-level species resolution is likely to be a more common priority at multiple levels of government, even if these types of studies are perhaps less represented in the literature. These specific eDNA metabarcoding applications are often in the conservation and invasive species areas (
Reference DNA sequences are a challenging element of the eDNA metabarcoding workflow for which to satisfy quality control and assurance criteria for sufficient confidence in results (
Further, since eDNA metabarcoding tools will often be multi-marker and implemented at local and regional scales, rarely global ones, the use of both single gene repositories and massive sequence databases is impractical as generating custom libraries from these is laborious and requires specialized expertise. As an example, GenBank only has patchy availability of geo-referenced vouchers because this is not a requirement for submission. Thus ensuring the traceability for specimen identity as established in NAMERS is not easily achievable in GenBank.
User friendly platforms and bioinformatic pipelines have been generally missing from eDNA metabarcoding research and development, yet with the potential global reach for this technology, these elements are going to become increasingly valuable for specialists and non-specialists alike. The web platform developed for NAMERS showcases several functionalities and accessibility features that set it above other leading reference DNA databases. We acknowledge some elements of the existing layout may not be scalable for larger databases but suit the regional focus well. The data table provides an overview of the species in the database and the availability of genetic data. The alignment viewer is the most advanced part of the platform, where users can choose multiple custom taxa and genes (one gene at a time), view the multiple sequence alignments, and download their custom reference DNA data sets. Other leading databases, such as MIDORI2 (
It is no longer far-fetched to make whole mitogenomes the new standard for reference DNA sequences given genome skimming capabilities (
We thank the following people for contributions related to specimen collection and curation: Liane Stenhouse (DFO), Paul Grant (DFO), Nellie Gagné (DFO), Louise-Marie Roux (DFO), Mélanie Roy (DFO), Joy Wade (Fundy Aquaculture Services), Rick Taylor (UBC Beaty Biodiversity Museum), Jordan Rosenfeld (BC Ministry of Environment and Climate Change Strategy), Bob Hanner (University of Guelph), Daniel Heath (University of Windsor), Gavin Hanke (Royal BC Museum), Teslin Tlingit Council, Pascale Savage (Yukon Government), Caren Helbing (University of Victoria), Amelia Louden (Burke Museum), Louis Lopez (University of Victoria), Hoda Rajabi (eDNAtec), Emily Porter (eDNAtec), and Avery McCarthy (eDNAtec). The authors would also like to thank two anonymous reviewers for their valuable input.
The authors have declared that no competing interests exist.
No ethical statement was reported.
This research was funded by Genome BC, Project #SIP26-06.
CLA, KMW, and SRG conceived of the study and obtained project funding. KMW prepared samples for sequencing. MH, NF, and GACS managed sequencing and performed bioinformatics. MK and GACS built the website with input from all authors. KMW and CLA wrote the manuscript with input from all authors.
Kristen M. Westfall https://orcid.org/0000-0001-7524-7145
All processed genetic data is available from https://namers.ca and is available in GenBank (Accession Numbers in Suppl. materials
Full list of species in NAMERS with the following voucher data: collection site, collection year, institution where voucher is housed, and catalogue number of voucher
Data type: docx
Explanation note: Metadata, site information, catalogue ID.
Genbank Accession Numbers for mitogenomes and nrDNA
Data type: docx
Explanation note: Note that nrDNA grey cells are full genes, white cells with an Accession Number are partial genes, and white empty cells are missing genes. Note that N/A in the complete mitogenome column indicates it is not complete and Genbank Accession numbers for available genes for those species are in Suppl. material
Genbank Accession Numbers for partial mitochondrial genes where the complete mitogenome was not recovered, blanks indicate the gene was not recovered for that species
Data type: docx
Explanation note: Nuclear ribosomal gene Accession Numbers for these species are listed in Suppl. material