Metabarcoding and Metagenomics :
Software Description
|
Corresponding author: Jan-Niklas Macher (jan.macher@uni-due.de)
Academic editor: Masaki Miya
Received: 14 Nov 2017 | Accepted: 13 Dec 2017 | Published: 15 Dec 2017
© 2017 Jan-Niklas Macher, Till-Hendrik Macher, Florian Leese
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Macher J, Macher T, Leese F (2017) Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic datasets: The BOLD_NCBI _Merger. Metabarcoding and Metagenomics 1: e22262. https://doi.org/10.3897/mbmg.1.22262
|
|
Metabarcoding and metagenomic approaches are becoming routine techniquesfor use in biodiversity assessment and in ecological studies. The assignment of taxonomic information to millions of sequences obtained via high-throughput sequencing is challenging, as many DNA reference libraries are lacking information on certain taxonomic groups and can contain erroneous sequences. Combining different reference databases is therefore a promising approach for maximising taxonomic coverage and reliability of results.
The “BOLD_NCBI_Merger” bash script is introduced, which combines sequence data obtained from the National Centre for Biotechnology Information (NCBI) GenBank and the Barcode of Life Database (BOLD) and prepares it for taxonomic assignment with the software MEGAN.
Biodiversity, High-throughput sequencing, Operational taxonomic unit, software, MEGAN, script, taxonomic assignment
High-throughput biodiversity assessment techniques such as metagenomics (
Studies have shown that both databases can be used to successfully identify metazoan taxa (
The “BOLD_NCBI_Merger” is introduced, a bash-script that builds databases containing sequence data from both BOLD and NCBI GenBank. In the tutorial accompanying the script (Suppl. material
Prior to analyses BLAST+ (v. 2.6), vsearch (
The bash script “BOLD_NCBI_Merger” concatenates multiple files downloaded from BOLD and NCBI, respectively. Then, COI sequences are extracted from the downloaded BOLD fasta file. COI is the most widely used gene for barcoding of metazoan taxa and most sequences deposited in the BOLD database are COI sequences. However, few sequences of other markers (e.g. 18S rRNA) are also deposited in BOLD. These sometimes get downloaded together with COI sequences and need to be removed in order for the script to work properly. Headers of both BOLD and NCBI files are formatted so that vsearch can dereplicate the sequences without cutting the header. Then, vsearch is used to dereplicate the sequences in order to prevent over-representation of sequences in the final database. In the next step, the headers are formatted so that MEGAN can identify species names. A local BLAST database is built from the concatenated BOLD and NCBI dataset. Finally, a BLAST search against the database is performed with a metabarcoding or metagenomics dataset. The resulting txt file can be imported into MEGAN and taxonomic assignments can be exported subsequently.
The detailed tutorial including all commands can be found in supplementary material 1. The package including the script used for processing and preparing sequence files can be found in supplementary material 2. Sequence data for the tutorial can be obtained from BOLD and NCBI GenBank, respectively. All Trichoptera sequences (used here as an example) can be downloaded as one fasta file from BOLD via the Public Data Portal (http://www.barcodinglife.org/index.php/Public_BINSearch?searchtype=records; search term: “Trichoptera”, “Public Data”). All Trichoptera sequences from GenBank can be downloaded from the nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/; search term: "Trichoptera AND (COI OR CO1 OR COX1 OR COXI; sequence length (e.g.): 1-1000 bp)" and saved on a local hard drive. Sequences other than COI can be processed as long as the data format is the same as for the COI data.
For ease of use, a dataset containing few sequences (Trichoptera, COI barcoding region) was used for this tutorial, but it should be noted that, for reliable results and real analyses, a larger reference database containing as many taxa as possible should be used in order to prevent erroneous assignments (
Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic datasets: The BOLD_NCBI _Merger
Metabarcoding, metagenomics and bioinformatics
The script was developed in the context of the European Cooperation in Science and Technology (COST) Action DNAqua-Net (CA15219).
Conceived and designed the study: JNM; Wrote the script: JNM, THM; Analysed the data: JNM, THM, FL; Wrote the paper: JNM, THM, FL
The supplementary material contains the BOLD_NCBI_Merger script, the needed folder structure and the tutorial explaining how to use the script