Research Article |
Corresponding author: Mingxin Liu ( mingxinl@pku.edu.cn ) Academic editor: Florian Leese
© 2023 Mingxin Liu, Christopher P. Burridge, Laurence J. Clarke, Susan C. Baker, Gregory J. Jordan.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Liu M, Burridge CP, Clarke LJ, Baker SC, Jordan GJ (2023) Does phylogeny explain bias in quantitative DNA metabarcoding? Metabarcoding and Metagenomics 7: e101266. https://doi.org/10.3897/mbmg.7.101266
|
Estimating species biomass or abundance from the number of high-throughput sequencing (HTS) reads is an aspirational goal for DNA metabarcoding, yet studies have found varied correlations. Performance varies depending on the gene marker and taxonomic group and, in part, may be related to primer-template mismatches, which are likely to exhibit phylogenetic signals. In this study, we compared commonly used fragments of two gene markers for beetles, the mitochondrial cytochrome c oxidase subunit I (COI) and 16S ribosomal RNA (16S), which have similar lengths, but different propensity for primer-template mismatches. We tested whether primer-template mismatches influence the relationship between species biomass and HTS read abundance and whether the effect of mismatches was explained by phylogeny. A significant correlation between species biomass and HTS read abundance existed for 16S, but not for COI, which had more primer-template mismatches. Models incorporating the effects of mismatch type or number improved the estimation of species biomass from HTS read abundance for COI and strong phylogenetic signals were identified. Researchers seeking to quantify biomass from metabarcoding studies should consider the effect of primer-template mismatches for the taxonomic group of interest and, for beetles, 16S appears a good candidate. Phylogenetic correction can also improve biomass estimation when using gene markers with higher primer mismatching.
Coleoptera, high-throughput sequencing, PCR amplification, phylogenetic signal, species biomass
Community composition is conventionally measured by identifying organisms using phenotypic features. However, this approach can be constrained by the availability of time and taxonomic expertise, especially for speciose groups, such as arthropods (
The causes of quantitative biases in metabarcoding are diverse and complex. An important source of bias during quantitative DNA metabarcoding is the oligonucleotide primer mismatch to binding sites during PCR amplification (
Quantification correction factors have been proposed to address amplification bias. For example, order-specific correction factors were developed using mock arthropod samples and the correlations between corrected HTS read abundance and input DNA increased from 0.09 to 0.82 (
Given that the evolution of primer site mismatches amongst species likely followed similar phylogenetic trajectories to the species themselves, knowledge of phylogenetic relationships has potential value for improving biomass estimation from HTS reads. In other words, closely-related taxa are more likely to share primer-template mismatches and deviation from an overall biomass-HTS read abundance relationship. Such phylogenetic signal has been identified in other sources of quantitative metabarcoding bias, such as gene copy number (
In this study, we tested whether primer-template mismatches influenced the relationship between biomass and HTS reads and whether the relationship could be explained by phylogeny. We used DNA fragments from two mitochondrial gene markers, the COI and 16S ribosomal RNA (16S) genes. These fragments had similar lengths (157 bp and 124–165 bp, respectively), but COI had more primer-template mismatches than 16S (
We re-analysed the high-throughput sequencing datasets in
Primer-template mismatches were scored using existing reference sequences. The total number of primer-template mismatches is likely to oversimplify models estimating species biomass with HTS read abundance (
Variables used in modelling HTS read abundance, based on species biomass and primer-template mismatches.
Term | Description |
---|---|
Dependent variable | |
HTS read abundance | Average sum of HTS reads for each species |
Independent variables | |
Species biomass | Average weight (in grams; to 4 decimal places) of specimens for a species across samples |
Mismatch number | Count of primer-template mismatches |
Mismatch position | Sum of mismatch scores based on their positions relative to the 3’ primer end† |
Mismatch type | Sum of mismatch scores based on their type† |
Deviation of predicted HTS read abundance | Residual in the linear model log10(HTS read abundance) ~ log10(species biomass) |
We retrieved original species biomass and HTS read abundance from each of the 24 samples. Then, our analysis was based on these data to model the relationship between them along with the information of primer-template mismatches. For example, we would have multiple data for a given species if they were present in multiple samples. Data were normalised and expressed as relative proportions by dividing the total HTS read abundance and total sample biomass by the original HTS read abundance and species biomass for each of the 24 samples. We have provided the original species abundance and their HTS read abundance across 24 samples in Suppl. material
Alternative linear models of relative proportion of HTS read abundance, based on the relative proportion of species biomass and/or primer-template mismatches for COI and 16S datasets. The best model as determined with lowest AICc value is highlighted in bold.
Marker | Model | Variable | Adj. R2 | P | ΔAICc |
---|---|---|---|---|---|
COI | #1 | Relative proportion of species biomass | 0.30 | 0.55 | 0 |
Mismatch position | 0.04* | ||||
Mismatch type | < 0.001*** | ||||
#2 | Mismatch type | 0.24 | < 0.001*** | 1.92 | |
#3 | Relative proportion of species biomass | 0.25 | 0.19 | 2.36 | |
Mismatch type | < 0.001*** | ||||
#4 | Relative proportion of species biomass | 0.18 | 0.12 | 7.85 | |
Mismatch number | < 0.01** | ||||
#5 | Mismatch number | 0.16 | < 0.01** | 8.04 | |
#6 | Relative proportion of species biomass | 0.02 | 0.16 | 17.05 | |
#7 | Relative proportion of species biomass | 0.01 | 0.12 | 18.52 | |
Mismatch position | 0.37 | ||||
#8 | Mismatch position | -0.01 | 0.59 | 18.83 | |
16S | #1 | Relative proportion of species biomass | 0.42 | < 0.001*** | 0 |
Mismatch position | 0.11 | ||||
Mismatch type | 0.02* | ||||
#2 | Relative proportion of species biomass | 0.41 | < 0.001*** | 0.46 | |
Mismatch type | 0.08 | ||||
#3 | Relative proportion of species biomass | 0.39 | < 0.001*** | 1.46 | |
#4 | Relative proportion of species biomass | 0.39 | < 0.001*** | 3.41 | |
Mismatch number | 0.65 | ||||
#5 | Relative proportion of species biomass | 0.39 | < 0.001*** | 3.61 | |
Mismatch position | 0.91 | ||||
#6 | Mismatch type | 0.02 | 0.08 | 51.07 | |
#7 | Mismatch position | -0.01 | 0.68 | 54.13 | |
#8 | Mismatch number | -0.01 | 0.88 | 54.29 |
In order to examine the phylogenetic effect of primer-template mismatches, we constructed a phylogenetic tree with reference DNA sequences of partial COI and 16S genes. Two species from the order Neuroptera (Apochrysa matsumurae, GenBank Accession: NC_015095 and Ascaloptynx appendiculatus, GenBank Accession: NC_011277) were selected as outgroups, based on
The total number of mismatches in the forward and reverse COI primer ranged from 0–7 across their 22 ZOTUs (mean ± SD = 3.1 ± 2.1), whereas there were only 0–3 mismatches for the 16S primer pair across their 25 ZOTUs (0.8 ± 1.2, Fig.
Linear regression models showing the direct relationship between relative proportion of HTS read abundance ~ relative proportion of species biomass for (A) COI and (B) 16S. The model fit was improved in each case by including terms relating to primer-template mismatches (Table
For 16S, HTS read abundance was significantly predicted by species biomass alone (Model 3, R2 = 0.39, P < 0.001, Table
The phylogenetic mixed models showed a strong signal for the COI dataset (posterior mean and 95% credible intervals for ĥ2 = 0.64 (0.31–0.87)) and, hence, the tendency for similar correlation between relative HTS read abundance and relative species biomass amongst closely-related species. However, such phylogenetic signal was not observed for the 16S dataset (ĥ2 = 0.37 (0–0.89)). The mismatch type and mismatch number both showed strong and significant phylogenetic signals for COI (all codons: λ = 0.79 and 0.72, P < 0.001; first and second codons together: λ = 0.85 and 0.87, P < 0.05;), but not for 16S (Table
Phylogenetic signal (Pagel’s λ) of predictor variables for the COI dataset. “Residuals” are derived from the model of relative proportion of HTS read abundance ~ relative proportion of species biomass. Variables ranked by Pagel’s λ from highest to lowest. A higher Pagel’s λ suggests stronger phylogenetic signal (more similar traits in more closely related taxa).
Variable | λ | P |
---|---|---|
Mismatch number (first and second codons) | 0.87 | < 0.05* |
Mismatch type (first and second codons) | 0.85 | < 0.05* |
Mismatch position (first and second codons) | 0.00 | 1.00 |
Mismatch number (all codons) | 0.72 | < 0.001*** |
Mismatch type (all codons) | 0.79 | < 0.001*** |
Mismatch position (all codons) | 0.04 | 0.74 |
Residuals | 0.17 | 0.25 |
Relative proportion of HTS read abundance | 0.14 | 0.33 |
Relative proportion of species biomass | 0.00 | 1.00 |
Correspondence between the model residuals of relative proportion of HTS read abundance ~ relative proportion of species biomass (left hand side) and the number of primer-template mismatches (right hand side) on phylogenetic trees for (A) COI and (B) 16S datasets. Species with lower number of primer-template mismatches are likely to have higher PCR amplification efficiency and produce relatively more HTS reads.
To date, attempts to quantify species’ biomass and abundance from HTS read abundance have had limited success (
Our study showed the potential of phylogenetic approaches to improve biomass quantification with COI. The number and type of COI primer-template mismatches exhibited significant phylogenetic signals which may be stronger with a denser sampling of related taxa and improved reference sequence databases. The number of mismatches was also associated with the model residuals of relative proportion of HTS read abundance ~ relative proportion of species biomass. This raises the possibility of phylogenetically corrected biomass estimation for gene markers like COI with more primer-template mismatches. The lack of a significant phylogenetic signal in either mismatch position or the model residuals of relative proportion of HTS read abundance ~ relative proportion of species biomass may be due to the relatively small number (n = 22 or 25) of species considered (a subset of data due to the quality of the reference sequence collection). We expect the phylogenetic signal to be stronger for larger datasets across distinct clades (
It has been suggested that considering only the total number of primer-template mismatches oversimplifies their effects on amplification efficacy (
Our study did not explore other factors, such as annealing temperature and cycle number during PCR amplification and lower annealing temperatures, in particular, have improved biomass quantification during metabarcoding (
Multiple mismatches are common in primers and COI primers with high degeneracy have become popular for arthropod metabarcoding (
Overall, our study demonstrates a phylogenetic basis of quantification biases from DNA metabarcoding and suggests an opportunity for correcting such biases based on phylogeny. Taxon-specific correction factors can be derived from mock samples by fitting a regression line for the correlation of input species biomass and recovered HTS read abundance while accounting for phylogeny (
This project was funded by the Australian Research Council Linkage Grant (LP140100075) with partner organisations Sustainable Timber Tasmania and VicForests, a Holsworth Wildlife Research Endowment for fieldwork and a student research grant for laboratory work from the Forest Practices Authority. ML was also supported by the China Scholarship Council (CSC No. 201608360109). We would like to thank Lynne Forster (UTAS) for confirming beetle species identifications, Adam Smolenski and Sharee McCammon (UTAS) for helping with HTS preparation.
No conflict of interest was declared.
No ethical statement was reported.
This project was funded by the Australian Research Council Linkage Grant (LP140100075) with partner organisations Sustainable Timber Tasmania and VicForests, a Holsworth Wildlife Research Endowment for fieldwork and a student research grant for laboratory work from the Forest Practices Authority. ML was also supported by the China Scholarship Council (CSC No. 201608360109).
Mingxin Liu: Conceptualisation, Formal analysis, Investigation, Writing - original draft & editing. Christopher P. Burridge: Funding acquisition, Formal analysis, Writing - review & editing. Laurence J. Clarke: Formal analysis, Writing - review & editing. Susan C. Baker: Funding acquisition, Formal analysis, Writing - review & editing. Gregory J. Jordan: Conceptualisation, Funding acquisition, Formal analysis, Writing - review & editing.
Mingxin Liu https://orcid.org/0000-0003-0436-4058
Christopher P. Burridge https://orcid.org/0000-0002-8185-6091
Laurence J. Clarke https://orcid.org/0000-0002-0844-4453
Susan C. Baker https://orcid.org/0000-0002-7593-0267
Gregory J. Jordan https://orcid.org/0000-0002-6033-2766
The demultiplexed paired-end HTS reads (fastq files) used for this study are available in the NCBI Short Sequence Archive (BioProjects: PRJNA612410 and PRJNA656915). All analysis code can be accessed via https://github.com/mingxinliu/Quantitative-Metabarcoding.
The bioinformatic pipeline of processing metabarcoding data
Data type: docx. file
Composition and biomass of 24 studied samples, and HTS read abundance of studied species
Data type: tables (xlsx. file)
Explanation note: table S1: The measurements of individual number, species mean biomass and total biomass across 24 samples; table S2: HTS read abundance of 22 and 25 species studied in COI and 16S dataset across 24 samples
Scoring scheme for mismatch between DNA template and primers
Data type: tables (docx. file)
Explanation note: table S3: Scoring scheme of mismatch types from