Research Article |
Corresponding author: Lidia Garrido-Sanz ( lidia.garrido.sanz@gmail.com ) Academic editor: Xin Zhou
© 2020 Lidia Garrido-Sanz, Miquel Àngel Senar, Josep Piñol.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Garrido-Sanz L, Senar MÀ, Piñol J (2020) Estimation of the relative abundance of species in artificial mixtures of insects using low-coverage shotgun metagenomics. Metabarcoding and Metagenomics 4: e48281. https://doi.org/10.3897/mbmg.4.48281
|
Amplicon metabarcoding is an established technique to analyse the taxonomic composition of communities of organisms using high-throughput DNA sequencing, but there are doubts about its ability to quantify the relative proportions of the species, as opposed to the species list. Here, we bypass the enrichment step and avoid the PCR-bias, by directly sequencing the extracted DNA using shotgun metagenomics. This approach is common practice in prokaryotes, but not in eukaryotes, because of the low number of sequenced genomes of eukaryotic species. We tested the metagenomics approach using insect species whose genome is already sequenced and assembled to an advanced degree. We shotgun-sequenced, at low-coverage, 18 species of insects in 22 single-species and 6 mixed-species libraries and mapped the reads against 110 reference genomes of insects. We used the single-species libraries to calibrate the process of assignation of reads to species and the libraries created from species mixtures to evaluate the ability of the method to quantify the relative species abundance. Our results showed that the shotgun metagenomic method is easily able to set apart closely-related insect species, like four species of Drosophila included in the artificial libraries. However, to avoid the counting of rare misclassified reads in samples, it was necessary to use a rather stringent detection limit of 0.001, so species with a lower relative abundance are ignored. We also identified that approximately half the raw reads were informative for taxonomic purposes. Finally, using the mixed-species libraries, we showed that it was feasible to quantify with confidence the relative abundance of individual species in the mixtures.
Eukaryotes, Metazoa, genome skimming, PCR-free, mock sample, sequenced genomes
Metabarcoding is a technique used to quantify species abundance in natural communities using high-throughput DNA sequencing. There is usually a PCR step to enrich the DNA of a certain genomic region before sequencing, in what is normally termed amplicon metabarcoding. This technique is well-established and used in many ecological settings and with different groups of organisms (
There is some consensus in literature that, if the PCR step could be avoided, then the metabarcoding process would be much more quantitative (
In eukaryotes, shotgun metagenomics has been mainly applied using chloroplasts and mitochondrial genomes (
In this paper, we imagine a world in which the complete genomes of all the species are known. We simulate this future world by preparing a reference database of insect species whose genome is already assembled to an advanced degree and available at the NCBI RefSeq repository. We shotgun-sequenced DNA from some of these species in low-coverage single-species libraries, prepared without any PCR step and develop the bioinformatic algorithms necessary to go from raw reads to species assignation. Subsequently, we apply our approach to known mixtures of insect DNA to see if the method produces a quantitative estimate of the insect species present.
This exercise is a preliminary test of the difficulties likely to be faced in the future when an important number of complete genomes becomes available. In particular, we address here the following questions: (1) Is the metagenomic method useful to set apart closely related insect species; (2) What is the proportion of reads that is truly informative for species identification; (3) How many reads are necessary to achieve a reasonable level of confidence to provide quantitative estimates of the relative species abundance?
We considered all insect species whose genome was sequenced, assembled and available at the NCBI RefSeq Database on 2 August 2018. In total, 115 representative genomes of insect species were downloaded; of those genomes, five were removed for different reasons (Suppl. material
From this group of 110 species, we selected 18 species for low-coverage genome sequencing (Table
Summary table of the species in single-species libraries; the first run (run #1) was performed in September 2016 and the second (run #2) in July 2018.
Run | Library | Code | Species | Order | Family | Cultured/Wild | Origin (Country) | Number of raw reads (single-end) | Number of reads after QC and mapping step | Genome coverage |
1 | 1 | PM | Papilio machaon | Lepidoptera | Papilionidae | Wild | Spain | 217,260 | 208,832 | 0.117 |
1 | 2 | DV | Drosophila virilis | Diptera | Drosophilidae | Cultured | Spain | 2,357,451 | 2,088,898 | 1.717 |
1 | 3 | DMe | Drosophila melanogaster | Diptera | Drosophilidae | Cultured | Spain | 1,141,884 | 1,094,403 | 1.195 |
1 | 4 | DMo | Drosophila mojavensis | Diptera | Drosophilidae | Cultured | Spain | 834,212 | 787,688 | 0.646 |
1 | 5 | BO | Bactrocera oleae | Diptera | Tephritidae | Wild | Spain | 290,498 | 279,835 | 0.093 |
1 | 6 | LH | Linepithema humile | Hymenoptera | Formicidae | Wild | Spain | 711,171 | 683,311 | 0.486 |
1 | 7 | AE | Acromyrmex echinatior | Hymenoptera | Formicidae | Cultured | Denmark | 116,597 | 110,086 | 0.059 |
1 | 8 | BT | Bombus terrestris | Hymenoptera | Apidae | Wild | Spain | 997,469 | 972,727 | 0.603 |
1 | 9 | AM | Apis mellifera | Hymenoptera | Apidae | Wild | Spain | 631,194 | 607,965 | 0.378 |
1 | 10 | AP | Acyrthosiphon pisum | Hemiptera | Aphididae | Cultured | USA | 342,344 | 282,940 | 0.092 |
2 | 1 | ACo | Atta colombica | Hymenoptera | Formicidae | Cultured | Denmark | 1,636,355 | 1,607,703 | 0.845 |
2 | 2 | BTa | Bemisia tabaci | Hemiptera | Aleyrodidae | Cultured | Spain | 1,256,606 | 1,170,057 | 0.307 |
2 | 3 | CL | Cimex lectularius | Hemiptera | Cimicidae | Wild | Spain | 1,753,361 | 1,703,285 | 0.515 |
2 | 4 | DMe | Drosophila melanogaster | Diptera | Drosophilidae | Cultured | Spain | 1,454,804 | 1,428,616 | 1.523 |
2 | 5 | DMo | Drosophila mojavensis | Diptera | Drosophilidae | Cultured | Spain | 898,735 | 820,019 | 0.697 |
2 | 6 | DV | Drosophila virilis | Diptera | Drosophilidae | Cultured | Spain | 668,442 | 619,733 | 0.488 |
2 | 7 | DSu | Drosophila suzukii | Diptera | Drosophilidae | Cultured | Spain | 1,255,192 | 1,178,528 | 0.811 |
2 | 8 | LH | Linepithema humile | Hymenoptera | Formicidae | Wild | Spain | 1,082,202 | 1,047,744 | 0.742 |
2 | 9 | PXy | Plutella xylostella | Lepidoptera | Plutellidae | Wild | Spain | 2,125,062 | 1,913,733 | 0.813 |
2 | 10 | SI | Solenopsis invicta | Hymenoptera | Formicidae | Wild | Argentina | 1,830,687 | 1,772,743 | 0.695 |
2 | 11 | VE | Vollenhovia emeryi | Hymenoptera | Formicidae | Wild | Japan | 1,743,917 | 1,679,101 | 0.912 |
2 | 12 | WA | Wasmannia auropunctata | Hymenoptera | Formicidae | Wild | Spain | 1,667,606 | 1,613,371 | 0.772 |
We prepared two kinds of libraries: 22 libraries with DNA of a single-species and 6 libraries with a mixture of DNA of several species at known relative concentrations. The single-species libraries were used to calibrate the bioinformatic pipeline of assignment of reads to species; the mixed-species libraries were used to test the ability of the calibrated method to estimate the relative abundance of individual species in mixtures. All libraries were prepared using the TruSeq DNA PCR-Free LT Kit of Illumina following the manufacturer’s instructions (Ref. 15037063).
All libraries were sequenced using an Illumina MiSeq with the 2x150 chemistry in three different runs, two runs for single-species (Table
Relative abundance of the species in mixed-species libraries (in October 2017).
Library | Acromyrmex echinatior | Acyrthosiphon pisum | Apis mellifera | Bactrocera oleae | Bombus terrestris | Drosophila melanogaster | Drosophila mojavensis | Linepithema humile | Papilio machaon | Number of raw reads (single-end) | Number of reads after QC and mapping steps |
1 | 0.5010 | 0.0078 | 0.0626 | 0.2505 | 0.1252 | 0.0157 | 0.0313 | 0.0039 | 0.0020 | 1,897,302 | 1,842,838 |
2 | 0.2505 | 0.0020 | 0.1252 | 0.5010 | 0.0626 | 0.0313 | 0.0157 | 0.0078 | 0.0039 | 1,674,754 | 1,597,601 |
3 | 0.3039 | 0.0389 | 0.1088 | 0.2158 | 0.1532 | 0.0548 | 0.0772 | 0.0276 | 0.0196 | 1,887,006 | 1,829,291 |
4 | 0.2158 | 0.0196 | 0.1532 | 0.3039 | 0.1088 | 0.0772 | 0.0548 | 0.0389 | 0.0276 | 1,557,348 | 1,511,536 |
5 | 0.2127 | 0.0747 | 0.1261 | 0.1787 | 0.1501 | 0.0890 | 0.1059 | 0.0628 | 1,767,384 | 1,709,991 | |
6 | 0.1787 | 0.0628 | 0.1501 | 0.2127 | 0.1261 | 0.1059 | 0.0890 | 0.0747 | 1,344,467 | 1,264,242 |
The target concentration of each species in the mixtures was calculated using a geometric law of parameter k (
For the sequences generated in the current study, we assessed the quality of raw reads with FastQC v0.11.7 (
We then aligned each read to all reference genomes individually using BWA 0.7.15-r1140 (
In general, one-read maps into several reference genomes, so an algorithm is needed to decide between alternative assignations of a read. In metabarcoding and metagenomics studies, reads are commonly assigned to taxa using the lowest common ancestor algorithm (LCA) (MEGAN:
Flow diagram of the computational pipeline used in this study. At the top, input data and, below it, the steps and tools needed for the identification procedure.
Flow diagram of the γ–δ algorithm. Only the two highest mapping ratios to two reference genomes of a single read are required. In the figure, it is assumed that the highest mapped ratio A1 belongs to the reference genome i.
A = nm / nt (1)
Even though a read r can be assigned to many references, the γ–δ algorithm only needs the two highest mapping ratios. Let A1 and A2 be the highest and second highest mapping ratios of r. We assume that A1 corresponds to the mapping ratio of r to the reference genome of species i. Then the assignation algorithm works in the following way (Figure
If A1 < γ, then r is non-informative (because it does not map well enough to any species)
If A1 ≥ γ and A2 ≥ δ, then r is non-informative (because it maps too well to two different species)
If A1 ≥ γ and A2 < δ, then r is informative and it is assigned to species i (because it maps well enough in one species and not in any other one).
As the best values of γ and δ are unknown, we used the single-species libraries to find the best combination of γ and δ. The reads were divided into a training set (75% of the reads chosen randomly) to find the best γ–δ and a test set (the remaining 25% of reads) to independently calculate the goodness of fit of the model. The tested values of γ and δ were all the combinations of γ = {0.99, 0.98, 0.97} and δ = {0.98, 0.97, 0.96} where γ > δ.
The above algorithm produced a list of species assigned to each read of a library. In single-species libraries, ideally, all reads should belong to the same species (from now on, the focal species). However, detection of additional species could occur for several reasons, such as contamination from the lab, sequencing errors and even tag jumping between multiplexed libraries (
In real samples, contaminants are hard to detect, but in our single-species libraries, they are not. If a read corresponds to a species handled simultaneously in the lab, but not sequenced, then it is probably a genuine contamination problem and could be removed from the list of recovered species.
The other kinds of wrongly assigned species likely produce a very low number of reads. The only way to deal with them is to set a detection limit (ε), so the species with a proportion of reads lower than ε are ignored. Here, we present results using the detection limits of 10-2, 10-3 and 10-4.
With the single-species libraries, we used three different criteria to decide which values of γ, δ and ε provided best results. The most important one was that the number of species reported had to be one for single-species libraries. In addition, we wanted to maximise the proportion of reads assigned to the focal species and the proportion of informative reads (assigned to any species).
In practical terms, we first fixed the ε parameter. Next, we compared the different γ–δ combinations using the PERMANOVA test (
The final output of the above analysis is a combination of values of γ, δ and ε that were best for the single-species libraries analysed in this study. The goodness of fit of this set of parameters was evaluated using the test set, i.e. the remaining 25% of reads were not used for the calibration.
As will be shown in the results, using the best values of γ, δ and ε, we still found in the single species libraries some reads that were wrongly assigned to non-focal species. To explore the identity of all these misidentified reads, we blasted them (or a subset of 100 reads when the total number of misclassified reads was higher) with megablast (
Mixed-species mock samples were processed following the same computational pipeline as outlined above (Figures
As can be seen from the results, we obtained a good quantitative estimation of species abundance in all mixed-species libraries. However, from a practical point of view, it would be interesting to investigate if sequencing depth can be reduced and a robust quantitative estimation of the relative species abundance maintained. Thus, more libraries could be multiplexed in a single run and so reduce the overall cost. To evaluate this possibility, we ran the same computational pipeline as before (using the chosen parameters γ, δ and ε), but randomly reducing the number of reads to a proportion of 0.1, 0.01 and 0.001 of the original ones. Each simulation was repeated 100 times, using a different random set of reads. Afterwards, we estimated the number and relative abundance of the recovered species in each rarefied sample and calculated the Pearson correlation between the actual and the estimated relative abundance of the species.
All statistical analyses were performed using R 3.4.2 (
The 22 libraries prepared from DNA of single-species of insects (Table
The most important characteristic of these libraries is that the number of species recovered, in theory, must be one. With these libraries, we parameterised two aspects of metagenomic species assignment: first, what is the appropriate detection limit (ε) for removal of spurious species (i.e. cut-off for minimum proportion of reads for species to be retained) and second, which are the best values of γ and δ.
For the detection limit ε, we describe in detail the process followed for the analysis of the first run of single-species libraries using the values of γ = 0.99 and δ = 0.98. The rest of the single-species libraries and all the other γ–δ combinations produced relatively similar results and are provided as Supplementary material (Suppl. material
After the application of the γ–δ algorithm, there were 19.6 ± 8.0 reference genomes (species) per library (Table
Proportion of reads assigned to species, in parentheses, for each one of the 10 single-species libraries included in the first sequencing run. The species in each library are shown in the header column. Species assignments are divided in each column into blocks: A, species with abundance higher than ε = 0.01; B, with abundance between ε = 0.01 and ε = 0.001; C, with abundance between ε = 0.001 and ε = 0.0001; D, with abundance below ε = 0.0001; E, potential contaminants. Codes of the species as in Suppl. material
Criteria | Lib. 1 – PM | Lib. 2 – DV | Lib. 3 – DMe | Lib. 4 – DMo | Lib. 5 – BO | Lib. 6 – LH | Lib. 7 – AE | Lib. 8 – BT | Lib. 9 – AM | Lib. 10 – AP |
A: above ε = 0.01 | PM (0.98249) | DV (0.9958) | DMe (0.99812) | DMo (0.99627) | BO (0.93349) | LH (0.99877) | AE (0.99862) | BT (0.99777) | AM (0.99627) | AP (0.99557) |
B: from ε = 0.01 to 0.001 | LC (0.00259) | AF (0.00226) | ||||||||
C: from ε = 0.001 to ε = 0.0001 | DSi, DSe, DS, DSu, LC | DAr, DEl, DB, LC, DO, DEu, DF, DBi | VE | BI, AF | ACer, AD | MPe | ||||
D: below ε = 0.0001 | NVi, PP | CQ, NVi, MP, OT, CS, CL, BL, PXy, BI, DAr, DF, DSe, DSu, DT, MDe, DO, NLu | DNo, DY, CL, DW, CQ, DEr, NVi, MP, DF, DO, NLu, DEu | DSi, DT, DSe, DNa, DH, DR, SL, DK, DSu, OT, PXy, DAn | NVi, MDe, LC | BT, CF, SI, CC, TZ |
WA, DNo, BI, TS | SI, LC, CCal | EM, DBi | DN, MS, F |
E: potential contaminants; Bold: species handled in the lab and sequenced; Italic: species handled in the lab but not sequenced | DV (0.01359) | BM (0.00078) | BM (0.00046) | BM (0.00036) | CCap (0.0655) | AE (0.0006) | BM (0.00055) | BM (0.00078) | BM (0.00062) | DMe (0.00288) |
BM (0.0021) | TCa (0.00025) | DV (0.00031) | DV (0.00008) | BM (0.00043) | BM (0.00035) | LH (0.00019) | DV (0.00025) | DMe (0.00014) | BM (0.00071) | |
DMe (0.00057) | DMe (0.00012) | LH (0.00004) | DMe (0.00004) | BT (0.00015) | DV (0.00005) | DV (0.00018) | DMe (0.00024) | DV (0.00006) | LH (0.00012) | |
BT (0.00047) | PM (0.00011) | BO (0.00004) | LH (0.00001) | DV (0.00012) | DMe (0.00005) | DMe (0.00005) | DMo (0.00008) | BT (0.00004) | DV (0.00011) | |
DMo (0.0002) | LH (0.00008) | BT (0.00003) | BT (0.00001) | LH (0.00012) | AP (0.00003) | DMo (0.00005) | LH (0.00004) | BO (0.00004) | BT (0.00009) | |
AE (0.00013) | DMo (0.00003) | DMo (0.00003) | PM (0.00001) | DMe (0.00006) | AM (0.00003) | BT (0.00003) | BO (0.00004) | LH (0.00003) | DMo (0.00009) | |
LH (0.0001) | BT (0.00002) | PM (0.00002) | AM (0.00001) | PM (0.00002) | DMo (0.00002) | TCo (0.00003) | AM (0.00002) | DMo (0.00002) | AM (0.00003) | |
AM (0.00006) | AE (0.00001) | AE (0.00001) | BO (<0.00001) | DMo (<0.00001) | BO (<0.00001) | BO (0.00001) | AP (0.00002) | PM (0.00002) | BO (0.00003) | |
AP (0.00006) | AM (0.00001) | AM (0.00001) | AE (<0.00001) | PM (<0.00001) | PM (0.00001) | AP (0.00001) | PM (0.00003) | |||
CCap (0.00006) | BO (0.00001) | AP (0.00001) | AP (<0.00001) | AE (0.00001) | CCap (<0.00001) | AE (0.00001) | ||||
BO (0.00003) | AP (<0.00001) | TCa (<0.00001) | CCap (<0.00001) | CCap (<0.00001) | ||||||
CCap (<0.00001) | CCap (<0.00001) | |||||||||
Total number of species | 14 | 31 | 30 | 32 | 12 | 15 | 14 | 17 | 16 | 15 |
Some of these additional species were handled in the same lab, but finally were not sequenced because of their poor quality or for other reasons. These species included Ceratitis capitata (Diptera: Trephitidae), Bombyx mori (Lepidoptera: Bombycidae) and Tribolium castaneum (Coleoptera: Tenebrionidae) in the first single-species run and B. mori again in the second run. Thus, they can legitimately appear in the species list because of lab contamination or tag-jumping. If we eliminate these species (Table
The next step is the removal of the species below a certain detection limit. If the species having a relative proportion below ε = 0.0001 were discarded (Table
The exploration of the misidentified reads in Table
The final step is to decide which of the tested γ–δ combinations provided better results. First, the number of identified species was closer to 1 for γ = 0.99 than for γ = 0.98 or γ = 0.97 (Figure
Summary boxplots of the 22 single-species libraries used to search the best combination of parameters γ and δ; in all cases, a detection limit of ε = 0.001 was used and contaminant species were discarded. (A) Number of identified species in the library. (B) Proportion of the assigned reads allocated to the right species. (C) Proportion of the total reads (after trimming and mapping) that were informative (i.e. assigned to any species). A different letter at the top of the figures indicates significant differences amongst γ–δ combinations.
The replicated single-species libraries (four species that were analysed in two separate runs) produced remarkably similar results. For example, both libraries of D. virilis had L. cuprina at a relative concentration higher than ε = 0.001; in the other three libraries, only the focal species was recovered above a value of ε = 0.001 (see Suppl. material
We tested the quality of the adjusted parameter set ε = 0.001, γ = 0.99 and δ = 0.98 obtained with the training set, using the remaining 25% of reads (i.e. the test set). The results were not statistically different between the test set and the training set for any of the analysed variables (Suppl. material
The six libraries prepared from DNA of multiple species of insects (Table
As in the single-species libraries, in the mixed-species libraries, there were also contaminants handled in the laboratory, but not sequenced. To the already mentioned C. capitata, B. mori and T. castaneum, we must add D. virilis (that was not sequenced in any mixed-species library) and L. humile (not sequenced in libraries 5 and 6). As we did before, we eliminated all these species as genuine contaminants (Table
Relative proportion of assigned reads to species, in parentheses, for each one of the 6 mixed-species libraries of Table
Criteria | Lib. 1 | Lib. 2 | Lib. 3 | Lib. 4 | Lib. 5 | Lib. 6 |
A: above ε = 0.01 | AE (0.64971) | AE (0.37924) | AE (0.44486) | AE (0.30977) | AE (0.32342) | AE (0.27894) |
BO (0.17167) | BO (0.37805) | BO (0.16181) | BO (0.2496) | BO (0.1536) | BO (0.18874) | |
BT (0.06582) | AM (0.12349) | AM (0.10563) | AM (0.15797) | AM (0.13569) | AM (0.16409) | |
AM (0.05033) | BT (0.03834) | BT (0.08595) | BT (0.06737) | DMo (0.12864) | DMo (0.10473) | |
DMo (0.02998) | DMe (0.02339) | DMo (0.0789) | DMo (0.05985) | BT (0.09288) | DMe (0.08641) | |
DMe (0.01052) | DMo (0.01707) | DMe (0.03823) | DMe (0.05742) | DMe (0.0694) | BT (0.082) | |
LH (0.0354) | LH (0.0536) | AP (0.06842) | AP (0.06055) | |||
AP (0.03213) | AP (0.01873) | PM (0.01576) | PM (0.01866) | |||
B: from ε = 0.01 to ε = 0.001 | AP (0.00577) | LH (0.00997) | PM (0.00434) | PM (0.00697) | ||
LH (0.00245) | AP (0.0017) | |||||
C: from ε = 0.001 to 0.0001 | PM (0.00082) | PM (0.0009) | AF (0.00024) | AF (0.00034) | AF (0.00029) | AF (0.00036) |
VE (0.00013) | AF (0.00026) | VE (0.00011) | ||||
AF (0.00012) | VE (0.0001) | |||||
D: below ε = 0.0001 | WA, BI, DAr, TCo, ACer, DB, TS, DEl, LC, DO, AD, DBi, TZ, DEu, MDe, CCal, DSi, ACep, CC, DN, DF, DK,DSe, EM, MP, ACo, BD, BL, CL, DR, DSu, NVi, PH, ZC | WA, ACer, BI, LC, AD, DAr, DB, TS, TCo, MDe, DEl, EM, NVi, DO, DSi, DS, DEu, ACo, BL, TZ, ACep, ZC, CQ, DBi, DSe, MP, BD, APl, CF, DT, DW, DY, HL, SI, SC | DAr, WA, DEl, DB, LC, BI, DO, ACer, DBi, MP, TCo, DSi, AD, TS, DEu, DF, MDe, DS, DSe, DT, ACo, TZ, BD, NVi, ACep, SI, DR, BL, DW, CCal, DN, DSu, DNa, DNo, BA, LD, NL, SL | VE, WA, ACer, DAr, DEl, DB, BI, DO, LC, AD, DSi, TS, MP, TCo, DEu, NVi, DF, DSe, MDe, DSu, DY, DBi, DS, ACo, ACep, DNo, CL, PXy, SF, DT, TZ, SI, BL, DW, DN, DNa, CQ, CF, DK, PP | DAr, DEl, VE, DB, DO, WA, LC, BI, ACer, MP, DSi, DF, DBi, DEu, AD, TS, TCo, NVi, DS, TZ, DSe, DW, NL, EM, MDe, DSu, SF, DNa, CQ, PP, ACep, DNo, CL, DT, DN, DK, CCal, BA, LD, APl, CC, DC, DEr | DAr, VE, DEl, ACer, DB, DO, LC, WA, BI, DEu, MP, DSi, DF, DS, DBi, AD, MDe, TCo, DSu, TS, DSe, EM, DNo, NVi, NL, DNa, ACep, DT, ACo, DH, TZ, CQ, PP, CL, DN, ZC, HL, AGa, Dan, MS, PR |
E: potential contaminants | CCap (0.01185) | CCap (0.02603) | CCap (0.01131) | CCap (0.01743) | CCap (0.01064) | CCap (0.01323) |
BM (0.0005) | BM (0.00112) | BM (0.00051) | BM (0.00044) | BM (0.00055) | BM (0.00155) | |
TCa (< 0.00001) | TCa (< 0.00001) | LH (< 0.00001) | LH (< 0.00001) | |||
DV (< 0.00001) | DV (< 0.00001) | |||||
Total number of species | 47 | 49 | 53 | 52 | 55 | 54 |
Scatter plots between the expected (i.e. as the mixtures were prepared in the lab; Table
We run the entire pipeline on a server with 2 Intel Xeon E5-2620 v3 processors with 6 cores each, which allowed a maximum of 24 threads, thanks to their hyper-threading technology. The total processing time varied between 54 min (library #6, 1.3 raw million reads) and 1 h 16 min (library #1, 1.9 raw million reads), most of it (89%) consumed by the mapping of the reads into the reference genomes and very little (3–4%) by the γ–δ algorithm (see Suppl. material
When only a proportion of 0.1 or even 0.01 of the initial reads was used, the number of recovered species was the same in libraries 3–6 as when all reads were used (Figures
Effect of the rarefaction of reads on the number of species detected (above ε = 0.001 and without contaminants) in the six mixed-species libraries (A to F correspond to libraries 1 to 6). The x axis indicates the proportion of reads used (when 1, all reads were used, so there is only one value); in the rest of the values, 100 random repetitions were conducted using the indicated proportion of reads. The horizontal dashed line of each plot indicates the actual number of species in the mixture.
Effect of the rarefaction of reads on the correlation coefficient r between the expected and the recovered relative abundance of the species in the six mixed-species libraries (A to F correspond to libraries 1 to 6). The x axis indicates the proportion of reads used (when 1, all reads were used, so there is only one value); in the rest of the values, 100 random repetitions were conducted using the indicated proportion of reads. The horizontal dashed line of each plot indicates the critical value of r, above which measured r is statistically significant at p < 0.05.
Metagenomics is a technology devised to obtain both taxonomic and functional gene information for entire communities of organisms (
Our data, collected from single insect specimens, produced assignments to ca. 20 species; similarly, in each mixed-species library (8–9 species), ca. 50 species were recovered, so most of the listed species were spurious (Tables
There are two possible causes for contaminant DNA. The first one is physical contamination in the preparation of the libraries in the lab; the second one is the index-hopping effect during the sequencing reaction (
Three species were handled simultaneously in the lab, but not sequenced. All these species appear in most libraries, generally in a proportion lower than 0.001 (Tables
Another possibility for inter-sample contamination is the worrisome tag-jumping effect (
In addition to the contaminants, many other species appeared in the lists of both the single and mixed-species libraries (Tables
The only way to eliminate these species from the species list of each library is to set a threshold for the relative abundance of the species, i.e. an analytical detection limit. A detection limit of ε = 0.001 eliminated all the unwanted species in all but 3 of the 28 artificial libraries (Tables
In the mixed-species libraries, the detection limit of ε = 0.001 removed all spurious species, with no exceptions. However, in libraries 1 and 2, DNA of Papilio machaon was used to prepare the mixtures at a low concentration (Table
We do not think that the presence of spurious species in our artificial libraries is specific to the way that we handled the DNA in the lab or to our species assignation algorithm. The problem is probably more general, but it is only exposed when artificial samples are analysed, especially in those consisting of only one species. Other researchers have found similar results using prokaryotes (
The main objective of using metagenomics for the quantification of the species abundance and hence of this study, was to overcome the PCR-biases of amplicon metabarcoding. Here, we showed that the metagenomics approach completely fulfilled this objective, whereas in amplicon metabarcoding, the quantification of the abundances of the species is sometimes good (
Our mixed-species libraries comprised ca. 1.7 million reads each, but the rarefaction experiment showed that, even with 100 times less reads (ca. 17000 per library), the quantification would still be good (Figure
However, it is important to remember that not all biases are corrected by shotgun metagenomics. Here, we began the process using extracted DNA, so all the biases in the generation of eDNA sequences (i.e. digestion rates in dietary studies or DNA degradation in the soil or in the water, or in the DNA extraction) are not accounted for. In particular, the same amount of biomass does not always render the same amount of DNA (
In metagenomics, there are, basically, two methods to assign reads to species; the assembly-based and the read-based approaches (
Consequently, we used here the read-based approach that assigns a species to every read by mapping it into a reference genome. As a mapper, we used BWA, but other possibilities would probably be good choices too (e.g. Bowtie2, MagicBlast, GEM;
The metagenomics approach presented here for eukaryotic species will not be a realistic option until the number of sequenced genomes is a substantial fraction of the total biodiversity. Today, the metagenomic method for taxonomic purposes is used mostly with genomes of organelles instead of whole genomes, because the number of sequenced organelle genomes is much higher than the number of whole genomes (e.g. today there are roughly, in the NCBI RefSeq database, 14 times more mitogenomes than whole genomes of insects). In addition, the number of sequenced organelle genomes is increasing quickly with new easier and faster methods, based on next generation sequencing and de novo assembly (
Mito-metagenomics has proved to be better than amplicon metabarcoding for quantification purposes (
It is also fair to question about the computational problems that would pose a future with huge reference databases when the genomes of most species are sequenced. Perhaps our implementation of the method could be so computationally costly that it would be inapplicable in practice. In our opinion, the method is perfectly manageable today in a modest computer server and will remain so in the foreseeable future. In the reported experiments, the maximum processing time of the entire pipeline per mixed-species library was of 1.3 hours, most of it being devoted to the mapping of reads into the reference genomes. This mapping of reads into genomes is a problem that fits into the category known as “embarrassingly parallel” applications (
In addition, the pipeline could eventually be modified in several ways to further reduce the execution time. (1) Selection of reference genomes in the database: when processing a sample, there is no need to compare the reads with all the animal genomes (or plant or fungi) in the world: if the interest is in insects, then only the genomes of insects known to occur in a certain geographical region should become the reference database. Thus, even in a future with the genomes of all species already sequenced, the number of genomes of interest will never be of millions, but of 103 to 105 genomes at most. (2) Filtering of the reference genomes database: we showed here that only approximately half of the reads were informative. The non-informative reads probably belong to certain regions of the genome that, when identified, could be filtered out from the reference database with appropriate programmes. (3) Elimination of non-informative reads in running time: in the γ–δ algorithm, a read that maps better than δ in two different genomes is declared non-informative. Once that read is detected, the mapping of it against the remaining reference genomes is not necessary anymore and finally (4) It is reasonable to assume that the power of the computers will continue to increase in the future as it has done in the past (
According to our results, the low-coverage shotgun metagenomic method is perfectly capable to set apart closely related insect species, like the four species of the genus Drosophila that we included in the artificial libraries. We also saw that, despite the risk that some reads were not in the reference databases that we used (reads of commensal or parasites species; parts of the genome not yet sequenced) or that some reads were very similar in more than one reference genome, we achieved a reasonable proportion (ca. 0.50) of truly informative reads. By using mixtures, we showed that it is possible with this technique to quantify with confidence the relative abundance of individual species in the mixtures and that, with much less sequencing depth than the one used here, it was possible to obtain comparable results (ca. 17000 reads in mixtures of ca. 10 species). Finally, a word of caution. The “dream” of getting an eDNA sample, sequencing it, mapping it against a growing DNA database and obtaining the species names and relative abundance of all species in the mixture that we tried to simulate in this study, is not without hurdles. The main one is obviously the low number and quality of eukaryote genomes sequenced so far, but also the impossibility of identifying, with confidence, species below a certain detection limit and the need to improve the algorithms in a future with huge genome databases and increased sequencing depth.
The data underpinning the analysis reported in this paper are deposited in the Dryad data Repository at https://doi.org/10.5061/dryad.t1g1jwsz7. The entire pipeline and the γ–δ algorithm implementation are available at GitHub: https://github.com/LidiaGS/g-d_algorithm.
The experiment was conceived by JP and designed by all authors. Bioinformatic analyses were performed by LGS and supervised by the other two authors. All authors contributed to the writing of the manuscript.
We are very grateful to several entomologists who kindly provided specimens for sequencing: Xavier Espadaler, Nicolás Pérez, Alfredo Ruíz, Francesc Mestres, Aleix Valls, Francisco Beitia, Carlos Hernández-Castellano, Joan Josep Ibañez, Carlos Pradera, Rasmus S. Larsen, Jacobus J. Boomsma, Luis Calcaterra and Misato O. Miyakawa. We also thank Anna Barceló and Roger Lahoz of the Genomics facilities of the UAB for extracting and sequencing the DNA. The comments of Bruce E. Deagle, Alfried P. Vogler, Yiyuan Li and Xin Zhou on earlier versions of the manuscript are also much appreciated. Financial help was provided by the grants TIN2017‐84553‐C2‐1‐R (Spanish Government) and AGAUR 2017 SGR 1001 (Generalitat de Catalunya).
Figure showing the comparison of the training and test datasets
Data type: Boxplot
Additional information of the reference genomes
Data type: Excel table
Relative species abundances for single-species libraries of the first run using different γ–δ combinations
Data type: Excel table
Relative species abundances for single-species libraries of the second run using different γ–δ combinations
Data type: Excel table
Relative species abundances for mixed-species libraries using the optimised γ–δ combination
Data type: Excel table
Relative species abundances of the four single-species libraries sequenced twice
Data type: Excel table
Assignment of the wrong identified species by “megablast” using nt NCBI database
Data type: Excel table
Execution times of each tool in the complete pipeline for each mixed-species library
Data type: Excel table