Research Article |
Corresponding author: Satoshi Nagai ( snagai@affrc.go.jp ) Academic editor: Alexander Weigand
© 2022 Satoshi Nagai, Sirje Sildever, Noriko Nishi, Satoshi Tazawa, Leila Basti, Takanori Kobayashi, Yoshizumi Ishino.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nagai S, Sildever S, Nishi N, Tazawa S, Basti L, Kobayashi T, Ishino Y (2022) Comparing PCR-generated artifacts of different polymerases for improved accuracy of DNA metabarcoding. Metabarcoding and Metagenomics 6: e77704. https://doi.org/10.3897/mbmg.6.77704
|
Accuracy of PCR amplification is vital for obtaining reliable amplicon-sequencing results by metabarcoding. Here, we performed a comparative analysis of error profiles in the PCR products by 14 different PCR kits using a mock eukaryotic community DNA sample mimicking metabarcoding analysis. To prepare a mock eukaryotic community from the marine environment, equal amounts of plasmid DNA from 40 microalgal species were mixed and used for amplicon-sequencing by a high-throughput sequencing approach. To compare the differences in PCR kits used for this experiment, we focused on the following seven parameters: 1) Quality, 2) Chimera, 3) Blast top hit accuracy, 4) Deletion, 5) Insertion, 6) Base substitution and 7) Amplification bias amongst species. The results showed statistically significant differences (p < 0.05) for all of the seven parameters depending on the PCR kits used. These differences may result from the different DNA polymerases included in each kit, although the result can also be influenced by PCR reaction conditions. Simultaneous analysis of several parameters suggested that kits containing KOD plus Neo (TOYOBO) and HotStart Taq DNA polymerase (BiONEER, CA, US) at the annealing temperature of 65 °C displayed better results in terms of parameters associated with chimeras, top hit similarity and deletions.
chimeric sequences, DNA polymerase, high throughput sequencing, metabarcoding, PCR-generated artifacts, 18S rRNA gene
DNA polymerases are widely used for DNA manipulation in vitro, including DNA cloning, sequencing, labelling, mutagenesis and other purposes. The fundamental ability of DNA polymerases to synthesise a deoxyribonucleotide chain is conserved. However, the more specific properties, including thermostability, fidelity (proofreading activity), processivity (persistence of sequential nucleotide polymerisation) and specificity (proportion of non-specific amplification) vary depending on the polymerase. For example, thermostability can range from < 4 to 1380 min at 95 °C amongst DNA polymerases (
Processivity indicates the number of nucleotides added to the DNA sequence during a single binding event (
The metabarcoding approach using universal primers in Illumina sequencers has become a popular method for environmental DNA analysis in aquatic ecosystems, owing to the development of high-throughput sequencing (HTS) technologies (
The resulting HTS data is influenced by various factors from DNA extraction to bioinformatics data analysis (
Therefore, we examined the effect of the selected PCR kits containing polymerase on the accuracy of amplicon-sequence reads of the 18S rRNA gene using the HTS-based technology. A mock sample of a eukaryote community was prepared by equally pooling plasmid DNAs from 40 microalgal species. The frequency of artifacts was compared amongst 14 PCR kits containing polymerase in the following seven parameters, i.e. frequencies of: 1) Quality; 2) Chimera; 3) Blast top hit accuracy; 4) Deletion; 5) Insertion; 6) Base substitution; and 7) Amplification bias amongst species. The results displayed statistically significant differences in the amplicon sequences from different PCR kits. Various studies can utilise metabarcoding analysis and the results generated through the experiments reported here will contribute to the planning of amplicon-based HTS studies and evaluation of the obtained data in terms of PCR kit (polymerase) associated bias.
HTS high-throughput sequencing;
OTUs operational taxonomic units.
Clonal strains of 40 microalgal species were isolated from plankton blooms in several localities from Japan (Suppl. material
To carry out metabarcoding analysis using the MiSeq 250PE platform (Illumina, USA), the same universal primer pair targeting 18S rRNA gene V4 hypervariable regions (around 415 bp in length;
In this study, fourteen commercially available DNA kits containing polymerase (Table
Overview of PCR kits compared in this study based on manufacturer`s information, numbering follows Figure
Nr. | PCR kit containing polymerase | Manufacturer | Proofreading ability | Proofreading compared to Taq | Hot start |
---|---|---|---|---|---|
1. | KOD-Plus-Ver.2 | TOYOBO | Y | 80 | Y |
2–4. | KOD-Plus-Neo | TOYOBO | Y | 80 | Y |
5. | KOD FX Neo | TOYOBO | Y | 80 | Y |
6. | Go Taq Green Master Mix | Promega | N | Y | |
7. | Type-it Microsatellite PCR Kit | QIAGEN | N | Y | |
8. | Q5 High-fidelity DNA Polymerase | NEW ENGLAND BioLabs | Y | 280 | Y |
9. | One Taq Hot Start DNA Polymerase | NEW ENGLAND BioLabs | Ya | 5b | Y |
10. | KAPA2GRobust HotStart Redy Mix with dye (2X) | KAPA Biosystems | N | Y | |
11. | KAPA HiFi HotStart Ready Mix | KAPA Biosystems | Y | 100 | Y |
12. | Premix Ex Taq Hot Start Version | TaKaRa | Yc | 4.5 | Y |
13. | Top DNA Polymerase | BiONEER | N | Y | |
14. | Pfu DNA Polymerase | BiONEER | Y | 30d | Y |
15–17. | HotStart Taq DNA Polymerase | BiONEER | N | Y | |
18. | Platinum SuperFi PCR Master Mix | Invitrogen | Y | 300 | Y |
The second PCR step was performed with a primer set of 5’-AATGATACGGCGACCACCGAGATCTACAC- 8 bp index -ACACTCTTTCCCTACACGACGC (forward) and 5’-CAAGCA GAAGACGGCATACGAGAT- 8 bp index -GTGACTGGAGTTCAGACGTGTG (reverse). The eight base segments represent dual-index sequences used to recognise each sample; the 5’ end-sequences are adapters that allow the final product to bind or hybridise to short oligonucleotides on the surface of the Illumina flow cell and the 3’ end-sequences are priming sites for the MiSeq sequencing. The purified first-round PCR product was diluted 5 times with TE buffer and used as a template for the second-round PCR. The second-round PCR was carried out with the same condition as the first-round PCR, except the reaction volume of 50 μl containing 2.0 μl of the diluted PCR product. Eight cycles of the PCR with the annealing temperature of 59 °C were set for all PCR reactions tested in this study (Suppl. material
When performing high throughput sequencing (HTS), a spike-in of the control library (Illumina standard product: PhiX V3 Control Library) was mixed with the pooled DNA library to improve the data quality of low diversity samples, such as single PCR amplicons. DNA concentrations of the pooled library and the PhiX DNA were adjusted to 4 nM using the buffer EB (10 mM Tris-HCl, pH 8.5) mixed at a ratio of 7:3.5 μl. The 4 nM library was denatured with 5 μl of fresh 0.1 N NaOH. Using the HT1 buffer (provided by the Illumina MiSeq v. 2 Reagent kit for 2× 250 bp PE), the denatured library (10 μl; 2 nM) was diluted to a final concentration of 12 pM for sequencing on the MiSeq platform.
Nucleotide sequences were demultiplexed depending on the 5’-multiplex identifier (MID) tag and primer sequences using the default format in MiSeq. The sequences containing palindrome clips longer than 30 bp and homopolymer longer than 9 bp were trimmed from the sequences at both ends. The 3’ tails with an average quality score of less than 30 at the end of the last 25 bp window were also trimmed from each sequence. The 5’ and 3’ tails with an average quality score of less than 20 at the end of the last window were also trimmed from each sequence. Sequences longer than 250 bp were truncated to 250 bp by trimming the 3’ tails. The trimmed sequences shorter than 200 bp were filtered out. The demultiplexing and trimming were performed using Trimmomatic version 0.35 (http://www.usadellab.org/cms/?page=trimmomatic). The remaining sequences were merged into paired reads using Usearch version 8.0.1517 (http://www.drive5.com/usearch/). In addition, singletons were removed. Subsequently, sequences were aligned using Clustal Omega v. 1.2.0. (http://www.clustal.org/omega/). Multiple sequences were aligned with each other and only sequences that were contained in more than 75% of the read positions were extracted. Filtering and a part of the multiple alignment process were performed using the screen.seqs and filter.seqs commands in Mothur, as described in the Miseq SOP (http://www.mothur.org./wiki/MiSeq_SOP) (
A subset of the nucleotide database consisting of 40 sequences obtained by sub-cloning of the 18S rRNA gene V4 region from 40 microalgal strains was prepared for a BLAST search. The BLAST search was conducted with NCBI BLAST+ 2.2.30+ (
The effect of fourteen PCR kits containing polymerase on the accuracy for amplification of 18S-rRNA gene in a mock sample of 40 microalgal species was examined in terms of performance in the following seven parameters: 1) Quality; 2) Chimera; 3) Blast top-hit accuracy; 4) Deletion; 5) Insertion; 6) Base substitution; and 7) Amplification bias amongst species. The calculation of each of the parameters is described in Table
Overview of the seven parameters used for evaluation of PCR kits based on frequencies, PE: paired-end, SD: standard deviation.
Order | Parameter | Definition |
---|---|---|
A | Quality | Read numbers, passed quality check (the number of reads after merge assemble of PE reads following quality trimming but before chimera check) / raw read numbers in each sample |
B | Chimera | Read numbers after Chimera check / read numbers before Chimera check in each sample. |
C | Blast top hit accuracy | Numbers of unique hit-sequences to the reference sequences of 40 species / input sequence-numbers in Blast search (= read numbers after Chimera check) in each sample |
D | Deletion | Deletion = 1- ((the number of deleted base in OTU1 × the read count of OTU1 + the number of deleted base in OTU2 × the read count of OTU2 + • • • + number of deleted base in OTU40 × read count of OTU40) / (the alignment length of OTU1 × the read count of OTU1 + the alignment length of OTU2 × the read count of OTU2 + • • • + the alignment length of OTU40 × the read count of OTU40))*103 |
E | Insertion | Insertion = 1- ((the number of inserted base in OTU1 × the read count of OTU1 + the number of inserted bases in OTU2 × the read count of OTU2 + • • • + number of inserted base in OTU40 × read count of OTU40) / (the alignment length of OTU1 × the read count of OTU1 + the alignment length of OTU2 × the read count of OTU2 + • • • + the alignment length of OTU40 × the read count of OTU40))*103 |
F | Base substitution | Base substitution = 1- ((the number of base substitution in OTU1 × the read count of OTU1 + the number of base substitution in OTU2 × the read count of OTU2 + • • • + the number of base substitution in OTU40 × read count of OTU40) / (the alignment length of OTU1 × the read count of OTU1 + the alignment length of OTU2 × the read count of OTU2 + • • • + the alignment length of OTU40 × the read count of OTU40))*103 |
G | Amplification bias among species | 1-normalized SD of numbers of unique hit-sequences to the reference sequences of 40 species for 40 species |
The following information: query ID, query length, bit score, top-hit name, top-hit identity, alignment length, query alignment length, query alignment sequence, reference alignment sequence and homology information between the query and reference sequences in the alignment (homology_string) were contained in a text file output. Numbers of top-hit sequences to the reference sequences of 40 species for each query ID were counted in the triplicate samples by count.seqs command in Mothur and merged with the information obtained from Blast XML by merge_tophit_count.pl (provided as Suppl. material
awk -F”\t” ‘!/;/&&NR!=1{ OFS=”\t”;x[1]=gsub(“-”,”-”,$8);x[2]=gsub(“-”,”-”,$9) ;x[3]= gsub(/\s/,” “,$10);print $1 ,x[1],x[2],x[3]; }’ result_merge.txt
Namely, to detect Deletion, Insertion and Base substitution, the number of the hyphen ‘-‘ in the query alignment sequence, the number of the hyphen ‘-‘ in the reference alignment sequence and the number of space between the query and reference sequences in the alignment were counted respectively and the information was saved as an output in a text file. For Amplification bias amongst species, the normalised standard deviation (SD) of numbers of unique hit sequences in 40 species were calculated for each PCR condition in Excel.
To evaluate whether there are statistically significant differences amongst the results from different PCR kits, based on the seven parameters, one-way ANOVA or Kruskal-Wallis rank-sum tests were conducted in R (
To display performances of PCR kits containing DNA polymerases amongst the investigated parameters intuitively, radar charts were employed. Radar charts represent the integrated performance evaluation of seven parameters in each PCR kit containing polymerase (14 kits) and three different annealing temperature conditions for KOD-Plus-Neo and HotStart Taq DNA Polymerase. The values on the radar charts range from 0–1; if the value is closer to 1, the PCR kit has a better performance in terms of the particular parameter.
All of the seven parameters used for the comparison of the sequence data from different PCR kits displayed statistically significant differences (p < 0.05; Table
Comparison between different polymerases based on the parameters evaluated.
Parameter | Saphiro-Wilk | Levene's test | Flinger/Killeen | ANOVA | Kruskal-Wallis |
---|---|---|---|---|---|
(W/p value) | (F/p value) | (X2/p) | (F/p value) | (X2/p value) | |
Quality | 0.98/0.71 | 1.64/0.99 | 3.17/0.002 | ||
Chimera | 0.85/6.645e-06 | 9.31/0.93 | 52.52/1.702e-05 | ||
Blast top hit accuracy | 0.44/ 4.818e-13 | 13.23/0.72 | 52.34/1.815e-05 | ||
Deletion | 0.82/ 1.323e-06 | 12.43/0.77 | 52.181/1.925e-05 | ||
Insertion | 0.91/0.001 | 8.10/0.96 | 52.10/1.98e-05 | ||
Base substitution | 0.93/0.003 | 12.40/0.77 | 52.39/1.782e-05 | ||
Amplification bias | 0.57/2.848e-11 | 5.35/0.99 | 50.68/3.313e-05 |
The average performance of different PCR kits containing a polymerase, based on the seven parameters: 1) Quality, 2) Chimera, 3) BLAST top-hit accuracy, 4) Deletion, 5) Insertion, 6) Base substitution and 7) Amplification bias amongst the species. PCR kits: 1: KOD-Plus-Ver.2; 2: KOD-Plus-Neo; 3: KOD-Plus-Neo (61 °C); 4: KOD-Plus-Neo (65 °C); 5: KOD FX Neo; 6: Go Taq Green Master Mix; 7: Type-it Microsatellite PCR Kit; 8: Q5 High-fidelity DNA Polymerase; 9: One Taq Hot Start DNA Polymerase; 10: KAPA2GRobust HotStart Ready Mix with dye (2X); 11: KAPA HiFi HotStart Ready Mix; 12: Premix Ex Taq Hot Start Version; 13: Top DNA Polymerase; 14: Pfu DNA Polymerase; 15: HotStart Taq DNA Polymerase; 16: HotStart Taq DNA Polymerase (61 °C); 17: HotStart Taq DNA Polymerase (65 °C); 18: Platinum SuperFi PCR Master Mix. The numbers shown in the parentheses indicate the annealing temperature in the first-round PCR, otherwise, the first-round PCR was performed at the annealing temperature of 56 °C.
The performance of reads detected against the unique reference sequences by BLAST search, parameter 3) Blast top-hit accuracy, displayed similarly high performance amongst the majority of the PCR kits, with significantly lower values obtained by KAPA2G Robust HotStart Ready Mix with dye (2X), KAPA HiFi HotStart Ready Mix, Premix Ex Taq Hot Start Version and Top DNA Polymerase (nr. 10- 13; Figs
In terms of deletions and insertions, the variability amongst the different PCR kits was low (Fig.
Amongst all the parameters evaluated, Premix Ex Taq Hot Start Version displayed significant differences with other PCR kits in four parameters (nr. 12; Fig.
When considering different parameters together, the majority of PCR kits displayed high performance in two parameters, which were mainly associated with the deletions and top-hit similarities (Fig.
Overview on the performance of the PCR kits/conditions compared amongst seven parameters: 1) Quality, 2) Chimera, 3) Top-hits accuracy (= BLAST top-hit accuracy), 4) Deletion, 5) Insertion, 6) Base substitution and 7) Species bias (= Amplification bias amongst the species). Higher values indicate higher performance in a specific parameter.
We detected significant differences amongst PCR kits for each of the compared parameters when using the mock community sample of marine microeukaryotes. The highest number of significant differences amongst kits was detected in association with BLAST top-hit accuracy, chimeras, insertions and base substitutions, whereas the lowest number of significant differences was detected in terms of the quality of sequences. The low number of significant differences in terms of quality may potentially be explained by similar sequencing depth due to equal concentrations of the pooled samples (
When analysing the different parameters in more detail, the first parameter displaying a high number of significant differences amongst kits was associated with chimera formation. It varied notably amongst the PCR kits and was the highest in the samples amplified by Q5 High-fidelity DNA Polymerase and Platinum SuperFi PCR Master Mix (Fig.
Previous studies investigating chimera formation and different polymerases report that reducing the amount of DNA template and the number of PCR cycles used (Oh et al., n.d.;
The second parameter displaying a high number of significant differences between the PCR kits was BLAST top-hit similarity. The significant differences were mainly associated with four kits (KAPA2G, KAPA HiFi, Premix Ex Taq and Top DNA polymerase). Interestingly, from those PCR kits, Premix Ex Taq and KAPA2G also displayed lower results in some other parameters, for example, in terms of chimeras and base subtitutions (KAPA2G) and in terms of chimeras, deletions and insertions (Premix Ex Taq). For one of the PCR kits (KAPA2G), the high proportion of errors may be explained by the lack of proofreading ability, whereas the second kit (Premix Ex Taq) has a proofreading ability. Thus, some other parameters or their combination might have a stronger influence on obtaining the correct sequences than the proofreading ability (
The parameters associated with base substitutions also varied notably amongst the PCR kits. As those errors may also arise from sequencing (
Polymerase errors can be influenced by various parameters, such as template DNA, for example, complexity of the gene targeted (single allele or several alleles), presence of repetitive sequences, secondary structure (
Interestingly, the amplification bias amongst species was generally low between the polymerases, except for Q5. The bias results from the high standard deviation between the species counts and should theoretically be similar for all the polymerases as the same mock community was analysed. Stochasticity of the PCR process has been shown to induce skewed sequence representation (
The majority of the kits tested in this study resulted in high values in terms of top-hit accuracy. This coincides with the results of a previous study finding no influence of polymerase choice on the species occurrence data in metabarcoding studies (
The purpose of this study was to compare PCR kits containing polymerases to minimise PCR-based amplification artifacts in environmental DNA analysis, based on HTS and metabarcoding. Statistically significant differences amongst PCR kits were detected for all the parameters. The kits displaying significant variability as well as the best results for each parameter varied. The results of the comprehensive analysis visualised by radar charts suggested that KOD plus Neo and HotStart Taq DNA PCR kits with the annealing temperature of 65 °C displayed better performances in the highest number of parameters associated with chimeras, top hit similarity, and deletions. Especially, the higher annealing temperatures reduced chimera formation. However, as the outcome may also be influenced by the template diversity (
The authors thank R. Kubota for her assistance with the molecular work. This work was funded by “Technological developments for characterization of harmful plankton in the seawater”, Ministry of Agriculture, Forestry and Fisheries, Japan (16808839) [SN]; JST/JICA, Science and Technology Research Partnership for Sustainable Development (JPMJSA1705) [SN]; Japan Society for the Promotion of Science Short-term Postdoctoral Fellowship (PE18028) [SN, SS]; European Regional Development Fund and the programme Mobilitas Plus (MOBTP160) [SS]; Estonian Research Council grant (PSG735) [SS].
Table S1–S3
Data type: accession numbers
Explanation note: Table S1. Information source and accession numbers on the nuclear ribosomal RNA gene (18S rRNA) of 40 species used for the mock sample. ND: no data. Table S2. Composition of PCR mixtures and PCR conditions. PCR amplification was done by using 14 PCR kits, for selected kits 3 different annealing temperatures were tested. Table S3. P-values resulting from posthoc comparisons of differences among PCR kits containing polymerase based on the seven parameters (A-G). Statistically significant p-values after Benjamini-Hochberg correction are marked with bold and in italics.
merge_tophit_count.pl
Data type: Blast XML
same_tophit_count_merge.pl
Data type: Blast XML
blastxml_parser
Data type: Blast XML