Software Description |
Corresponding author: Haris Zafeiropoulos ( haris-zaf@hcmr.gr ) Academic editor: Alexander Probst
© 2021 Haris Zafeiropoulos, Laura Gargan, Sanni Hintikka, Christina Pavloudi, Jens Carlsson.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Zafeiropoulos H, Gargan L, Hintikka S, Pavloudi C, Carlsson J (2021) The Dark mAtteR iNvestigator (DARN) tool: getting to know the known unknowns in COI amplicon data. Metabarcoding and Metagenomics 5: e69657. https://doi.org/10.3897/mbmg.5.69657
|
The mitochondrial cytochrome C oxidase subunit I gene (COI) is commonly used in environmental DNA (eDNA) metabarcoding studies, especially for assessing metazoan diversity. Yet, a great number of COI operational taxonomic units (OTUs) or/and amplicon sequence variants (ASVs) retrieved from such studies do not get a taxonomic assignment with a reference sequence. To assess and investigate such sequences, we have developed the Dark mAtteR iNvestigator (DARN) software tool. For this purpose, a reference COI-oriented phylogenetic tree was built from 1,593 consensus sequences covering all the three domains of life. With respect to eukaryotes, consensus sequences at the family level were constructed from 183,330 sequences retrieved from the Midori reference 2 database, which represented 70% of the initial number of reference sequences. Similarly, sequences from 431 bacterial and 15 archaeal taxa at the family level (29% and 1% of the initial number of reference sequences respectively) were retrieved from the BOLD and the PFam databases. DARN makes use of this phylogenetic tree to investigate COI pre-processed sequences of amplicon samples to provide both a tabular and a graphical overview of their phylogenetic assignments. To evaluate DARN, both environmental and bulk metabarcoding samples from different aquatic environments using various primer sets were analysed. We demonstrate that a large proportion of non-target prokaryotic organisms, such as bacteria and archaea, are also amplified in eDNA samples and we suggest prokaryotic COI sequences to be included in the reference databases used for the taxonomy assignment to allow for further analyses of dark matter. DARN source code is available on GitHub at https://github.com/hariszaf/darn and as a Docker image at https://hub.docker.com/r/hariszaf/darn.
Docker, environmental DNA (eDNA), metabarcoding, mitochondrial cytochrome C oxidase subunit I, software tool, tree of life (tol), unassigned sequences
DARN is a software approach aiming to provide further insight into COI amplicon data of environmental samples. Building a COI-oriented reference phylogenetic tree is a challenging task especially considering the small number of microbial curated COI sequences deposited in reference databases; e.g. ~4,000 bacterial and ~150 archaeal sequences in BOLD. Inevitably, as more and more such sequences are collated, the DARN approach improves. To provide a more interactive way of communicating both our approach and our results, we strongly suggest the reader to visit this Google Collab notebook where the building of the reference COI phylogenetic tree is described step-by-step and also this GitHub pages site where our results are demonstrated. Our approach corroborates the known presence of microbial sequences in COI environmental sequencing samples and highlights the need for curated bacterial and archaeal COI sequences and their integration into reference databases (i.e., Midori, BOLD etc). We argue that DARN will benefit researchers as a quality control tool for their sequenced samples in terms of distinguishing eukaryotic from non-eukaryotic OTUs/ASVs, but also in terms of understanding the known unknowns. As the cover ratio of COI sequences of the known taxa increases, approaches such as the one used in this study, will also enable the identification/prediction of unknown unknowns.
DNA metabarcoding is a rapidly evolving method that is being more frequently employed in a range of fields, such as biodiversity, biomonitoring, molecular ecology and others (
The underlying idea of the method is to take advantage of genetic markers, i.e. marker loci, using primers anchored in conserved regions. These universal markers should have enough sequence variability to allow distinction among related taxa and be flanked by conserved regions allowing for universal or semi-universal primer design (
Nevertheless, there is not a truly “universal” genetic marker that is capable of being amplified for all species across different taxa (Kress et al. 2015). Different markers have been used for different taxonomic groups (
The mitochondrial cytochrome c oxidase subunit I (also called cox1 or/and COI) is a gene fragment of ~700 bp, widely used for metazoan diversity assessment. Here we present some of the reasons that microbial eukaryotes and prokaryotes are also amplified in such studies, raising the issue of the known unknown sequences.
COI is a fundamental part of the heme aa3-type mitochondrial cytochrome c oxidase complex: the terminal electron acceptor in the respiratory chain. Even if aa3-type Cox have been found in bacteria, there are also other cytochrome c oxidase (Cox) groups, such as the cbb3-type cytochrome c oxidases (cbb3-Cox) and the cytochrome ba3 (
Furthermore, the presence of highly divergent nuclear mitochondrial pseudogenes (numts) has been a widely known issue on the use of COI in barcoding and metabarcoding studies, leading to overestimates of the number of taxa present in a sample (
Thus, as
Even though there are various known issues (
The co-amplification of prokaryotes explained above, is a major reason for why many Operational Taxonomic Units (OTUs) and/or Amplicon Sequence Variants (ASVs) in eDNA metabarcoding studies cannot get taxonomy assignments when metazoan reference databases are used (c.f.
The aim of this study was to build a framework for extracting such non-target, potentially unassigned (or assigned with low confidence) sequences from COI environmental sequence samples, hereafter referred to as “dark matter” as per
More specifically, based on the previously described methodology by
Sequences for the COI region from all the three domains of life were retrieved from curated databases. Eukaryotic sequences were retrieved from the Midori reference 2 database (version: GB239) (
Number of sequences and taxonomic species per domain of life and resources. The (#) symbols stands for “number”.
Resources | bacteria | archaea | ||
---|---|---|---|---|
# of sequences | # of strains | # of sequences | # of strains | |
BOLD | 3,917 | 2,267 | 117 | 117 |
PFam-oriented | 9,154 | 4,532 | 217 | 115 |
Total unique entries | 11,421 | 6,798 | 334 | 201 |
Overview of the approach followed to build the COI reference tree of life. Sequences were retrieved from Midori 2 (eukaryotes) and BOLD (bacteria and archaea) repositories. Consensus sequences at the family level were built for each domain specific dataset. MAFFT and consensus sequences at the family level were built using the PhAT algorithm. The COI reference tree was finally built using IQ-TREE2. Noun project icons by Arthur Slain and A. Beale.
The large number of obtained sequences effectively prevents a phylogenetic tree construction encompassing their total number in terms of building a single phylogenetic tree covering all of the three domains of life (archaea, bacteria, eukaryota). Therefore, consensus representative sequences from each of the three datasets were constructed using the PhAT algorithm (
In the case of Eukaryotes, the alignment of the corresponding sequences would be impractically long because of their large number (~183K sequences). To address this challenge, a two-step procedure was followed; a sequence subset of 500 sequences (reference set) was selected and aligned and then used as a backbone for the alignment of all the remaining eukaryotic COI sequences. All sequences were considered reliable as they were retrieved from curated databases (Midori2 and BOLD). To build the reference set, a number (n) of the longest sequences from each of the various phyla were chosen, proportionally to the number (m) of sequences of each phylum (see Suppl. material
A total of 1,109 consensus sequences (70% of total consensus sequences) were built covering the eukaryotic taxa, while 463 (29%) bacterial and 21 (1%) archaeal consensus sequences were included. The per-domain, consensus sequences returned can be found under the consensus_seqs directory on the GitHub repository (see _consensus.fasta files). These sequences were then merged as a single dataset and aligned to build a reference MSA; this time MAFFT was set to return using the --globalpair algorithm and the --maxiterate parameter equal to 1,000. The MSA returned was then trimmed with the ClipKIT software package (
The reference tree was then built based on this trimmed MSA using the IQ-TREE2 software (
In the .iqtree file there are the branch support values; SH-aLRT support (%) / ultrafast bootstrap support (%).
A thorough description of all the implementation steps for building the reference tree is presented in this Google Collab Notebook. The computational resources of the IMBBC High Performance Computing system, called Zorba (
The COI reference tree was subsequently used to build and implement the Dark mAtteR iNvestigator (DARN) software tool. DARN uses a .fasta file with DNA sequences as input and returns an overview of sequence assignments per domain (eukaryotes, bacteria, archaea) after placing the query sequences of the sample on the branches of the reference tree. Sequences that are not assigned to a domain are grouped as “distant”. It is necessary for the input sequences to represent the proper strand of the locus, i.e. input reads should have forward orientation. Optionally, DARN invokes the orient module of the vsearch package (
The focal query sequences are aligned with respect to the reference MSA using the PaPaRa 2.0 algorithm (
To visualise the query sequence assignments, a two-step method was developed. First, DARN invokes the gappa examine assign tool which taxonomically assigns placed query sequences by making use of the likelihood weight ratio (LWR) that was assigned to this exact taxonomic path. In the DARN framework, by making use of the --per-query-results and --best-hit flags, the gappa assign software assigns the LWR of each placement of the query sequences to a taxonomic rank that was built based on the taxonomies included in the reference tree. The first flag ensures that the gappa assign tool will return a tabular file containing one assignment profile per input query while the latter will only return the assignment with the highest LWR. DARN automatically parses this output of gappa assign to build two input Krona profile files based on a) the LWR values of each query sequence and b) an adjustive approach where all the best hits get the same value in a binary approach (presence - absence). In the final_outcome directory that DARN creates, two .html files, one for each of the Krona plots; Krona plots are built using the ktImportText command of KronaTools (
DARN also runs the gappa assign tool with the --per-query-results flag only. This way, the user can have a thorough overview of each sample’s sequence assignments, as a sequence may be assigned to more than one branch of the reference tree, sometimes even to different domains. However, in cases with sequences assigned to multiple branches, the likelihood scores are most typically up to 100-fold to 1000-fold different.
DARN source code as well as all data sequences and scripts for building the reference phylogenetic tree are available on GitHub.
The inferred phylogenetic tree is shown in Figure
To examine whether the phylogenetic-based taxonomy assignment addresses a real-world issue, a local blast database was built using the total number of the consensus sequences retrieved. As expected, when the consensus sequences were blasted against this local blastdb, all were matched with their corresponding sequences. However, when a mock dataset was used to evaluate the two approaches (blastdb and the phylogenetic tree) none of the bacterial sequences were captured as bacteria after blastn against the local blastdb (see output file here). All bacterial sequences returned an incorrect eukaryotic assignment. Contrarily, when the phylogenetic tree was used, all the bacterial sequences were captured.
To evaluate DARN on the presence of dark matter we analysed a wide range of cases to show the ability of DARN to detect and estimate dark matter under various conditions. Both eDNA and bulk samples, from marine, lotic and lentic environments, were selected to reflect various combinations of primer and amplicon lengths, PCR protocols and bioinformatics analyses (Table
DARN outcome over the samples or set of samples. Assignment fractions of the sequences per domain per sample in the DARN results over the samples.
Sample(s) accession number | Envo type | Sample type | Primer set | Amplicon length (bp) | Preservation method | Annealing temperature | Bioinfo pipeline(s) | # of ASVs | ~ % of sequence assignments per domain (if PEMA, using d = 10) | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Eukaryotes | Bacteria | Archaea | distant | |||||||||
ERS6449795–ERS6449829 | marine | eDNA | multiplex: jgHCO2198 - jgLCO1490 and LoboF1 - LoboR1 | 658 | water stored at 4 °C / filtered within 24 hours | 60 °C × 35 cycles | QIIME2 - Dada2 | 13,376 | 11.0 | 88.0 | 0.02 | 0.003 |
PEMA (d = 10) | 39,454 | 25.0 | 75.0 | 0.1 | 0.4 | |||||||
ERS6463899–ERS6463901 | marine reef | eDNA | mlCOIintF - jgHCO2198 | 313 | filters stored with silica beads | 46 °C × 35 cycles | JAMP | 1,304 | 35.0 | 65.0 | 0.0 | 0.2 |
dada2 | ||||||||||||
PEAR | ||||||||||||
vsearch | ||||||||||||
DnoisE | ||||||||||||
PEMA (d = 10) | 11,545 | 46.0 | 50.0 | 1.0 | 3.0 | |||||||
ERS6463906–ERS6463911 | ||||||||||||
ERS6463913–ERS6463918 | ||||||||||||
ERS6463920–ERS6463922 | ||||||||||||
ERS6463744–ERS6463761 | marine mangrove | JAMP | 663 | 40.0 | 60.0 | - | 0.6 | |||||
dada2 | ||||||||||||
PEAR | ||||||||||||
vsearch | ||||||||||||
DnoisE | ||||||||||||
PEMA (d = 10) | 5,879 | 49.0 | 47.0 | 1.0 | 2.0 | |||||||
ERR3460466 | marine | bulk | mlCOIintF - jgHCO2198 | 313 | DMSO / -20 °C | 62 °C (-1 °C/cycle) × 16 cycles and 46 °C × 24 cycles | PEMA (d = 2) | 193 | 99.0 | 1.0 | - | - |
ERR3460467 | marine | bulk | ETOH / -20 °C | 74 | 97.0 | 0.0 | - | 3.0 | ||||
ERR3460470 | marine | eDNA | -20 °C | 184 | 71.0 | 28.0 | 0.0 | 1.0 | ||||
ERS6488992 | lentic | eDNA | fwhF2 - EPTDr2 | 142 | ATL-buffer | 60 °C × 6 cycles and 48 °C × 26 cycles | PEMA (d = 10) | 416 | 85.0 | 7.0 | 3.0 | 5.0 |
ERS6488993 | lentic | 315 | 99.2 | 0.4 | 0.4 | - | ||||||
ERS6488994 | lentic | 823 | 90.0 | 4.0 | 2.0 | 4.0 | ||||||
ERS6488995 | lotic | eDNA | BF3 - BR2 | 458 | ATL-buffer | 50 °C × 35 cycles | PEMA (d = 10) | 1,940 | 64.0 | 34.0 | 2.0 | 0.3 |
More specifically, 57 marine, surface water, eDNA samples from Ireland were analysed through a. QIIME2 (
The number of sequences returned, using various bioinformatic analyses, ranged from circa 3k to 214k (Table
Significant proportions of non-eukaryote DARN assignments were observed in all marine eDNA samples (Table
By making use of a COI – oriented reference phylogenetic tree built from 1,593 consensus sequences, to phylogenetically place sequences from COI metabarcoding samples onto it, the surmise for including bacteria, algae, fungi etc. (
As publicly available bacterial COI sequences are far too few to represent the bacterial and archaeal diversity, their reliable taxonomic identification is not currently possible. This way, bacterial, i.e. non-target, sequences that were amplified during the library preparation have at least the possibility of a taxonomy assignment. Our implementations using DARN indicate that it is essential both for global reference databases (e.g., BOLD, Midori etc) and custom reference databases which are commonly used, to also include non-eukaryotic sequences.
While our approach specifically addressed the COI gene, DARN can be adapted to analyse any locus fragment. For instance, metabarcoding of environmental samples for the 12S rRNA mitochondrial region is often employed to assess fish biodiversity (
The approaches implemented in DARN can benefit both bulk and eDNA metabarcoding studies, by allowing quality control and further investigation of the unassigned OTUs/ASVs. The approach is also adaptable to other markers than COI. Moreover, the approach presented here allows researchers to better understand the known unknowns and shed light on the dark matter of their metabarcoding sequence data.
License: GNU GPLv3. For third-party components separate licenses apply.
H.Z. → Conceptualization, Methodology, Software, Validation, Investigation, Resources, Writing – Original draft
L.G. → Methodology, Validation, Investigation
S.H. → Validation, Investigation
C.P. → Validation, Investigation, Writing – Original draft
J.C. → Conceptualization, Investigation, Resources, Writing – Original draft
GitHub repo: https://github.com/hariszaf/darn
DockerHub repo: https://hub.docker.com/r/hariszaf/darn
The sequence data that support the findings of this study are available in the European Nucleotide Archive (ENA) with the following study accession numbers:
Marine samples from Ireland: PRJEB45030
Marine samples from Honduras: PRJEB45038
Marine ARMS samples: PRJEB33796
Lake and riverine samples from Norway: PRJEB45246
This research was supported in part through computational resources provided by IMBBC (Institute of Marine Biology, Biotechnology and Aquaculture) of the HCMR (Hellenic Centre for Marine Research). Funding for establishing the IMBBC HPC has been received by the MARBIGEN (EU Regpot) project, LifeWatchGreece RI and the CMBR (Centre for the study and sustainable exploitation of Marine Biological Resources) RI.
In addition, sampling and sequencing of the Marine ARMS samples was funded by the infrastructure programs ASSEMBLE Plus (grant no. 730984) and the European
Marine Biological Resource Centre, EMBRC. Both programs establish and maintain the core ARMS-MBON network (http://arms-mbon.eu/) and provide services and consultation for deployment, sample processing, sequencing, data management, and analysis. The Honduran samples were partly funded by the Irish Research Council (grant no. GOIPG/2018/326), supported by Operation Wallace Ltd, UK. This research was in part funded by the Ecostructure project (part-funded by the European Regional Development Fund through the Ireland Wales Cooperation Programme 2014–2020).
We would also like to thank Dr Evangelos Pafilis (email: pafilis@hcmr.gr) for providing us access to the IMBBC HPC infrastructure, Dr Frode Fossøy (email: frode.fossoy@nina.no) for providing us environmental, lake and riverine sequence samples from Norway and the reviewers who provided thorough and fruitful comments to the manuscript.
Table S1
Data type: excel table
Explanation note: Number of sequences per phylum in the Eukaryotes domain dataset and the corresponding number of the longest sequences used in the 500 sequences subset (reference set) used as a backbone for the complete alignment.
Table S2
Data type: excel table
Explanation note: Number of sequences in each DARN experiment before and after the sequence orientation step.
Figure S1
Data type: png. image
Explanation note: Hylogenetic tree of the consensus sequences retrieved showing the distribution of the eukaryotic phyla.
Figure S2
Data type: png. image
Explanation note: Placements of the consensus sequences used to build the COI reference phylogenetic tree for the DARN tool, onto the phylogenetic tree (stroke width for the branches of the tree is 5). The color coding represents the placements per branch, with a range from zero (blue) to a maximum of 2 (blue). The 1 leaf – 1 placement relationship, as well as the maximum of 2 placements in the color coding bar, indicate the proper placement of each consensus sequence to its corresponding branch.