DNA metabarcoding of Neotropical ichthyoplankton: Enabling high accuracy with lower cost

Knowledge of ichthyoplankton dynamics is extremely important for conservation management as it can provide information about preferential spawning sites, reproductive period, migratory routes and recruitment success, which can be used to guide management and conservation efforts. However, identification of the eggs and larvae of Neotropical freshwater fish is a difficult task. DNA barcodes have emerged as an alternative and highly accurate approach for species identification, but DNA barcoding can be time-consuming and costly. To solve this problem, we aimed to develop a simple protocol based on DNA metabarcoding, to investigate whether it is possible to detect and quantify all species present in a pool of organisms. To do this, 230 larvae were cut in half, one half was sequenced by the Sanger technique and the other half was used to compose six arrays with a pool of larvae that were sequenced using a next-generation technique (NGS). The results of the Sanger sequencing allowed the identification of almost all larvae at species level, and the results from NGS showed high accuracy in species detection, ranging from 83% to 100%, with an average of 95% in all samples. No false positives were detected. The frequency of organisms in the two methods was positively correlated (Pearson), with low variation among species. In conclusion, this protocol represents a considerable advance in ichthyoplankton studies, allowing a rapid, cost-effective, quali-quantitative approach that improves the accuracy of identification.


Introduction
Studies concerning ichthyoplankton can provide novel information about preferential spawning sites, reproductive period, migratory routes, and recruitment success (Baumgartner et al. 2004, Bialetzki et al. 2005, Reynalte-Tataje et al. 2012). These are extremely important, as they provide useful knowledge for conservation management, such as the delimitation of protected areas (de Silva et al. 2015;. However, morphological identification of eggs and larvae have several limitations, because, in the early stages, the morphological diagnos-tic characters are not yet fully formed (Baumgartner et al. 2004, Reynalte-Tataje et al. 2012. Such limitations restrict efforts to gather information about the life history strategies of most species, including endangered species, thus hampering proposals for targeted management actions.
The use of DNA barcodes in ichthyoplankton improved the quality of identifications, both for eggs and larvae, and emerged as an effective alternative for traditional identification methods (Becker et al. 2015, Frantine-Silva et al. 2015. Nevertheless, despite its high accuracy, when applied in extensive studies with thousands of eggs and larvae, DNA barcoding can be time-consum-ing and costly, because each organism has to be individually processed and sequenced (Evans et al. 2016, Elbrecht and Leese 2017, Mariac et al. 2018.
To solve this problem, DNA metabarcoding, based on next generation sequencing (NGS) could offer an alternative approach, allowing a reduction of labor time as well as costs. Although Neotropical ichthyofauna metabarcoding studies are scarce (Maggia et al. 2017, Mariac et al. 2018, Sales et al. 2019, this technique has been utilized in several ecological studies on ichthyoplankton and also to identify gut contents (Cruaud et al. 2017, Guillerault et al. 2017, Maggia et al. 2017, Divoll et al. 2018, Barbato et al. 2019.
Here we describe a novel DNA metabarcoding protocol for ichthyoplankton using NGS technology and test the accuracy of this protocol in identifying all species present in an array. We also test whether the proposed approach is valid for estimating the relative frequency of species present in a sample.

Field sampling
The larvae used in this experiment come from another study using DNA barcoding for species identification. Ichthyoplankton sampling was carried out in the Mogi-Guaçu river, an important tributary of the left bank of the Grande River, Upper Paraná River Basin. Samples were taken from two sites: 1: 21°35.441'S, 47°57.244'W (Dec/16 and Jan/17) and site 2: 21°30.168'S, 48°2.534'W (Dec/15, Nov/16, Dec/16, and Jan/17) (Table 1), located in the northeast region of the state of São Paulo, Brazil. The sites are about 15 km apart from each other in a straight line.
Collecting took place with a conical-cylindrical net with a mesh size of 0.5 mm. Samples were immediately fixed in 96% ethanol. In the laboratory, eggs and larvae were separated from other materials present in the samples (e.g., leaves, sediment, and sticks) under a stereomicroscope and kept in 96% ethanol until the molecular procedure.

Sampling and experiment setup
In this study, we used 230 larvae distributed randomly across six arrays, with the number of organisms per array ranging from 36 to 44. However, each sample contained larvae of only one site and month ( Table 1). Each larva was cut into halves; one half was subject to gDNA extraction and subsequent Sanger sequencing, while the other half was pooled and homogenized with other larvae from the same sample for combined DNA extraction and subsequent metabarcoding using next-generation sequencing (NGS) (Figure 1).

Sanger sequencing
DNA extraction for Sanger sequencing was made following the protocol proposed by Ivanova et al. (2006). Cy-tochrome C Oxidase Subunit I (COI) primers FishF1 and FishR1 (Ward et al. 2005) were selected because these have a good species recovery in the hydrographic basin where the study occurred. PCR conditions were: 1 min at 96 °C followed by 35 cycles of 10 s at 96 °C, 5 s at 50 °C and 4 min at 60 °C. The PCR product was sequenced using a Big Dye 3.1 Terminator kit (Applied Biosystems) in an ABI Prism 3130 (Applied Biosystems), generating two complementary sequences of 654 bp.

DNA metabarcoding
DNeasy Blood and Tissue Kit (Qiagen) was used for DNA extraction from bulk samples following the manufacturer's protocol. Two PCR runs were performed: the first to amplify the COI region, using the same primer pair used for the Sanger sequencing coupled with Illumina MiSeq adapter, and the second to attach a multiplex identifier (MID).

Data analyses
For sequences generated with the Sanger method we used the Geneious Pro v4.8.5 software (Kearse et al. 2012), where forward and reversed files were combined to assemble a consensus sequence. All these consensus sequences were then exported as fasta files to be matched against the same database comprising the Illumina reads. Metabarcoding data were analyzed following a bioinformatic protocol described in the Suppl. material 1. As the overlapping is not possible due to the length of PCR product (654 bp), only the forward file was used (Cruaud et al. 2017). This file was selected by checking the sequencing quality with FASTQC (Andrews 2010) (Suppl. material 2). Next, a quality trimming with Trimmomatic (Bolger et al. 2014) was performed, and we filtered out reads with a minimum quality of Q20 in at least the first 250 nt. We also cut the sequences up to the length of 250 nucleotides since the quality of final region was low (Suppl. material 2). For this trimming we used the Fastx-Trimmer software (http://hannonlab.cshl.edu/fastx_toolkit/).
After this step, we used the OBITools toolkit version 1.01 22 (http://metabarcoding.org/obitools; Boyer et al. 2016) for DNA metabarcoding analysis. We first used the "obiuniq" command to dereplicate the sequences, to reduce the size file and computation time. In this step, all of the identical DNA sequences were grouped into a single sequence, scoring their abundance for comparison with the Sanger data. Subsequently, we used the "obigrep" command to remove sequences with less than 50 reads.
We constructed a custom database by downloading sequences from Project -FUPR Fishes from Upper Parana River, Brazil, present in the BOLD System (Ratnasingham and Hebert 2007), since this project was executed in the same river basin as the current study. Also, we downloaded sequences from RefSeq (https://www.ncbi.nlm. nih.gov/refseq/), using keywords "fish", "fishes", and "COI", and then filtered the results only for "bony fishes". For the first step of identification we used the BLASTN software (Altschul et al. 1990) to indicate which sequences had the greatest similarity to those obtained via Sanger and metabarcoding.
As a final filtering step, all sequences generated in the present study (NGS + Sanger) were aligned with those sequences present in a custom database that had similarity greater than 95% with one of the sequencing methods, using the Muscle (Edgar 2004) aligner implemented in Geneious 4.8.5 (Kearse et al. 2012). Next, we calculated the genetic distances among and within the species, using the Kimura-2-Parameter (K2P) distance model (Kimura 1980). To provide a graphic representation of the patterning of divergence among species, a neighbor-joining (NJ) dendrogram of K2P distances was created using the MEGA v 7.0 software (Kumar et al. 2016) for each sample. Those sequences (NGS and Sanger) that presented divergence greater than 2% in the K2P analysis compared to the sequences downloaded from the BOLD System database were removed from the analysis of abundance (Pereira et al. 2013).
To validate the efficacy of metabarcoding, statistical analyses were conducted to evaluate whether this method has a quali-quantitative resolution. To achieve our qualitative aim, from the genetic distance (K2P) and neighbor-joining (NJ) dendrogram, we estimated richness (S = number of species) per sample in each method. Then, the richness between methods was compared to estimate the rate of species detection by metabarcoding, using the Sanger data as a reference.
To achieve our quantitative analysis, as the exact number of larvae per sample was known, the relative frequency per species (or taxon) was estimated for the Sanger data. This frequency compared with the value of the relative frequency (number of reads assigned to a taxon or species in relation to the total) obtained metabarcoding. This comparison was made per sample individually (e.g., Sanger sample 1 × NGS 1 sample). After this, we applied a Pearson correlation to test if the frequencies of Sanger and NGS showed a significant positive correlation and a permutational analysis of variance (PERMANOVA) to evaluate significant differences between relative abundances of Sanger and NGS data.

DNA sequencing
Of the 230 larvae used in this experiment (Suppl. material deposited in: https://doi.org/10.6084/m9. figshare.6726956) 226 were retained for the analysis with Sanger sequencing. Four larvae were removed, due to low quality in Sanger sequencing and genetic divergence above 2% in the alignment with those downloaded sequences. Results from the Sanger sequencing showed a richness of 29 taxa, belonging to 12 families and three orders, varying from six species in sample S2 to 13 species in sample S1 ( Table 2). Only three taxa were assigned to family or subfamily level.
NGS yielded 2,511,656 reads in the eight samples (Suppl. material deposited in: https://doi.org/10.6084/m9. figshare.6726956) and, after the bioinformatic analysis, 633,224 reads were kept. In total, 28 taxa were identified with metabarcoding NGS out of the 29 taxa identified by the Sanger sequencing. Only Astyanax schubarti was not identified by metabarcoding in any of the samples ( Table  2).

Species detection (qualitative approach)
No false positives were observed in this study (Table 2); all the taxa retrieved by metabarcoding were also found in Sanger sequencing data. However, in four out of the six samples, one taxon was not detected by NGS: S1 and S4 -Astyanax schubarti, S2 -Megaleporinus obtusidens, S5 -Astyanax lacustris. In the other two samples, all taxa detected by Sanger sequencing were also detected by NGS (S3 and S6). The taxa detection rate was higher than 83% in all samples, with 100% of detection in four samples and higher than 90% in three others (mean, 95%).

Relative abundance (quantitative approach)
The correlation between Sanger and NGS methods showed few differences in all samples. We detected a considerable difference between the two methods in only some cases: Rhinelepis aspera, Cheirodontinae sp.1, and Astyanax lacustris, in sample 1 and Prochilodus lineatus and Astyanax lacustris in sample 4. Nevertheless, the samples showed non-significant differences between the relative abundances of the two methods in PERMANO-VA analysis (Pseudo-F = 0.39293; P(perm) = 0.838), denoting that, despite some deviations in the values, the quantification of organisms with the NGS method can be considered reliable here (Table 3; Figure 2). Also, in the same way, all samples showed higher values of Pearson's correlation in relative frequency between Sanger and NGS methods, varying from 0.776 (sample S1) to 0.998 (sample S6) (Table 3; Figure 2).

Species identification
Despite some failures in species detection, we can consider that the use of NGS is functional for COI-based ichthyoplankton species detection, with an average detection rate higher than 95%, which is similar to previous findings that compared Sanger and NGS (Cruaud et al. 2017). Also, when compared to morphology-based methods of egg and larvae identification, detection rates were even more satisfactory, as more than 50% of the larvae caught in natural populations cannot be identified due to the absence of diagnostic characters (newly hatched larvae) or damage due to fixation process (Bialetzki et al. 2005;da Silva et al. 2015;. With our present approach, most individuals were identified to the species level, demonstrating that even us-ing only the forward reads, the information was sufficient for species identification. Only three taxa were assigned to family or subfamily level only, probably because they are cryptic species, not described or with sequences not deposited in any database. This observation reinforces the idea that more sequences of fish species must be deposited in databases, especially in river basins with rich fish fauna as the Upper Paraná River. Another positive aspect is that, compared with DNA barcoding (Arroyave and Stiassny 2014, Becker et al. 2015, Frantine-Silva et al. 2015, our method can trigger significant advances in the quality of ecological studies in fish, as observed in some recent studies with metabarcoding (Guillerault et al. 2017, Maggia et al. 2017, Mariac et al. 2018, Barbato et al. 2019) and, mainly, to ensure significant reduction in labor time and cost per specimen reaching 80% of reduction (Shokralla et al. 2015). Despite some differences in obtaining DNA samples, both environmental DNA (eDNA) and DNA metabarcoding from bulk samples are subject to false negatives and false positives (Ficetola Gentile et al. 2015Divoll et al. 2018). However, in the present study, only false negatives were observed. False negatives can lead to losses in mitigating measures of threatened species, due to the non-detection of target species, which might lead to underestimation of populations and non-implementation of mitigation measures, causing injury to these species (Ficetola Gentile et al. 2015Divoll et al. 2018). False negatives are mainly those related to inhibition issues, primer bias, PCR or sequencing errors, and bioinformatics analysis (Ficetola Gentile et al. 2015Divoll et al. 2018).

Relative abundance
Quali-and quantitative studies involving metabarcoding and eDNA have increased significantly in recent years (Hänfling et al. 2016;Valentini et al. 2016;Evans and Lamberti 2017;Yamamoto et al. 2017;Nakagawa et al. 2018;Mariac et al. 2018). However, the reliable quantification of organisms remains a problem (Evans et al. 2016;Kelly et al. 2017;Hansen et al. 2018;Mariac et al. 2018). eDNA studies have shown a positive correlation between species-specific DNA concentrations in biomass and abundance in eDNA (Takahara et al. 2012;Maruyama et al. 2014;Doi et al. 2015;Evans et al. 2016). In this way, by focusing on developing a low-cost yet efficient protocol, here we show that PCR-based methods can be used to determine relative abundance in studies of eggs and larvae.

Conclusions
Our results clearly show that the NGS approach on bulk samples can detect ichthyoplankton diversity at the species level, and also gives good estimates of relative abundance of the larvae, thus allowing reliable environmental monitoring at reduced cost and labor (Hansen et al. 2018). These results are extremely important since a great part of ichthyoplankton remains undefined with the current technology, impairing management actions. By applying metabarcoding protocols, the identification of all sampled organisms is possible, even those that occur in low density, such as endangered species. Considering ecological aspects, the use of metabarcoding in ichthyoplankton studies to identify fish larvae and eggs can provide several advances, such as precise determination of spawning sites of migratory and/or threatened species, and early detection of invasive species, which can help and guide management actions (Almeida et al. 2018).