Metabarcoding and Metagenomics : Software Description
PDF
Software Description
Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic datasets: The BOLD_NCBI _Merger
expand article infoJan-Niklas Macher, Till-Hendrik Macher, Florian Leese
‡ Aquatic Ecosystem Research, University of Duisburg-Essen, Essen, Germany
Open Access

Abstract

Background

Metabarcoding and metagenomic approaches are becoming routine techniquesfor use in biodiversity assessment and in ecological studies. The assignment of taxonomic information to millions of sequences obtained via high-throughput sequencing is challenging, as many DNA reference libraries are lacking information on certain taxonomic groups and can contain erroneous sequences. Combining different reference databases is therefore a promising approach for maximising taxonomic coverage and reliability of results. 

New information

The “BOLD_NCBI_Merger” bash script is introduced, which combines sequence data obtained from the National Centre for Biotechnology Information (NCBI) GenBank and the Barcode of Life Database (BOLD) and prepares it for taxonomic assignment with the software MEGAN.

Keywords

Biodiversity, High-throughput sequencing, Operational taxonomic unit, software, MEGAN, script, taxonomic assignment

Introduction

Background

High-throughput biodiversity assessment techniques such as metagenomics (Yu et al. 2012) and metabarcoding (Taberlet et al. 2012) produce millions of sequences in a short amount of time. These techniques are becoming standard in many fields of research (Deiner et al. 2015, Choo et al. 2017, Macher et al. 2017), as well as application (Elbrecht et al. 2017). One of the challenges connected to the analyses of millions of DNA sequences is the assignment of the obtained Operational Taxonomic Units (OTUs) to taxonomic names. Taxonomic information is often needed, especially in ecological studies and for the assessment of ecosystem health, which is largely based on the knowledge of species’ ecological traits (Gayraud et al. 2003, Hering et al. 2006). Several existing databases contain millions of DNA reference sequences, which can be used to assign taxonomic names to OTUs (Santamaria et al. 2012). However, these databases are often specialised, each containing mostly data for certain genetic markers (e.g. rRNA: SILVA (Quast et al. 2012) or selected taxonomic groups (e.g. fungi: UNITE Kõljalg et al. 2005). Two of the largest reference databases are the Barcode of Life Database (BOLD, Ratnasingham and Hebert 2007), which contains mostly cytochrome c oxidase I (COI) sequences of Metazoa and the National Centre for Biotechnology Information (NCBI) GenBank database (Benson et al. 2012), which contains reference sequences of taxa from all domains of life. Sequence data is available for download via websites and/or command line applications and can be used for taxonomic assignment via different tools. This is a standard approach in metabarcoding and metagenomic studies, as it is not feasible to identify millions of sequences one by one. For the identification of sequences from metabarcoding studies targeting metazoan taxa, the BOLD Identification API (http://www.boldsystems.org/index.php/resources/api?type=idengine) is often used (e.g. Prosser et al. 2017, Kranzfelder et al. 2015). BLAST+ (Camacho et al. 2009) searches against the NCBI GenBank are often used for the identification of non-metazoan sequences obtained through metagenomic approaches (Hasan et al. 2014, Shi et al. 2013), as well as confirming results of searches against the BOLD database (Kranzfelder et al. 2015, Elbrecht and Leese 2015). Web tools and APIs remotely accessing databases tend to be rather slow, making fast identification of millions of sequences and OTUs a time-consuming task. In addition, the BOLD database is somewhat restricted and does not contain all sequences that are deposited in the NCBI GenBank, which is due to the focus on genetic barcodes of metazoan taxa and of a certain length (several hundred basepairs). On the other hand, reliability of information in the curated BOLD database is expected to be higher than that in the NCBI database, although errors do occur (e.g. Lis et al. 2016). The NCBI GenBank, however, does not include all sequences available in the BOLD database, as not all scientists submit their sequences to both databases.

Studies have shown that both databases can be used to successfully identify metazoan taxa (Sonet et al. 2013), but uncertainties remain. Metagenome sequencing studies and metabarcoding studies using degenerated primers are known to produce data not only from either microbial or metazoan taxa, but also from all trees of life (Capra et al. 2016, Macher and Leese 2017, Horton et al. 2017). For such studies, taxonomic assignment with the BOLD database only will result in the loss of information, as many non-metazoan taxa cannot be identified. Using only the NCBI GenBank can circumvent this problem, but at the cost of losing information on metazoan taxa and lowered accuracy. Combining information from both databases therefore improves both speed of identification, reliability of results and taxonomic coverage. However, although theoretically possible, studies are currently not directly combining databases in order to improve speed and accuracy of analyses. This might be partly due to the large amount of data that needs to be downloaded on to a local hard drive and the needed reformatting of data in order to make it compatible, which requires basic bioinformatic skills. Several tools for analyses and taxonomic assignment of sequences downloaded from reference databases are available and could theoretically be used with combined databases, e.g. RDP Classifier (Wang et al. 2007), KRAKEN (Wood and Salzberg 2014), SPINGO (Allard et al. 2015) and MEGAN (Huson et al. 2007).

The “BOLD_NCBI_Merger” is introduced, a bash-script that builds databases containing sequence data from both BOLD and NCBI GenBank. In the tutorial accompanying the script (Suppl. material 1), the method used to download and prepare data for analyses in the MEGAN software is explained. The built database can also be used for analyses and software other than MEGAN. MEGAN implements a lowest common ancestor (LCA) approach for taxonomic assignment of sequences and was originally developed for analyses of metagenomic datasets (Huson et al. 2007), but the LCA approach can also be used for taxonomic assignment of sequences obtained through metabarcoding (Hänfling et al. 2016, Horton et al. 2017).

Technical specification

Prior to analyses BLAST+ (v. 2.6), vsearch (Rognes et al. 2016) and MEGAN need to be installed. All analyses described in the tutorial were conducted on a Mac with OS Yosemite 10.10.5.

The bash script “BOLD_NCBI_Merger” concatenates multiple files downloaded from BOLD and NCBI, respectively. Then, COI sequences are extracted from the downloaded BOLD fasta file. COI is the most widely used gene for barcoding of metazoan taxa and most sequences deposited in the BOLD database are COI sequences. However, few sequences of other markers (e.g. 18S rRNA) are also deposited in BOLD. These sometimes get downloaded together with COI sequences and need to be removed in order for the script to work properly. Headers of both BOLD and NCBI files are formatted so that vsearch can dereplicate the sequences without cutting the header. Then, vsearch is used to dereplicate the sequences in order to prevent over-representation of sequences in the final database. In the next step, the headers are formatted so that MEGAN can identify species names. A local BLAST database is built from the concatenated BOLD and NCBI dataset. Finally, a BLAST search against the database is performed with a metabarcoding or metagenomics dataset. The resulting txt file can be imported into MEGAN and taxonomic assignments can be exported subsequently.

The detailed tutorial including all commands can be found in supplementary material 1. The package including the script used for processing and preparing sequence files can be found in supplementary material 2. Sequence data for the tutorial can be obtained from BOLD and NCBI GenBank, respectively. All Trichoptera sequences (used here as an example) can be downloaded as one fasta file from BOLD via the Public Data Portal (http://www.barcodinglife.org/index.php/Public_BINSearch?searchtype=records; search term: “Trichoptera”, “Public Data”). All Trichoptera sequences from GenBank can be downloaded from the nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/; search term: "Trichoptera AND (COI OR CO1 OR COX1 OR COXI; sequence length (e.g.): 1-1000 bp)" and saved on a local hard drive. Sequences other than COI can be processed as long as the data format is the same as for the COI data.

For ease of use, a dataset containing few sequences (Trichoptera, COI barcoding region) was used for this tutorial, but it should be noted that, for reliable results and real analyses, a larger reference database containing as many taxa as possible should be used in order to prevent erroneous assignments (Porter et al. 2014, Garcia-Etxebarria et al. 2014, Ueno et al. 2014). In-depth studies, comparing different software usable for taxonomic assignment and different combinations of databases, should be conducted in order to quantify the benefits and possible pitfalls of combining data from several databases. It should also be mentioned that the approach of assigning taxonomy to OTUs by using local databases has limitations. As the created database is stored on a local hard drive, it does not receive automated updates and will age. Thus, the databases need to be updated on a regular basis. This requires some effort, since several gigabytes of data need to be downloaded from NCBI and BOLD databases, a process which can take several hours. Processing large amounts of data on a local hard drive also requires machines powerful enough to complete the task within a reasonable amount of time. Still, the approach of combining databases will be worth the efforts for many studies targeting diverse biological communities, as taxonomic assignment is fast and reliable once the local databases have been constructed and the gained information can help improve results.

Project description

Title: 

Combining NCBI and BOLD databases for OTU assignment in metabarcoding and metagenomic datasets: The BOLD_NCBI _Merger

Study area description: 

Metabarcoding, metagenomics and bioinformatics

Web location (URIs)

Technical specification

Platform: 
Unix
Programming language: 
Bash
Operational system: 
Linux, macOS

Usage rights

Use license: 
Open Data Commons Attribution License

Acknowledgements

The script was developed in the context of the European Cooperation in Science and Technology (COST) Action DNAqua-Net (CA15219).

Author contributions

Conceived and designed the study: JNM; Wrote the script: JNM, THM; Analysed the data: JNM, THM, FL; Wrote the paper: JNM, THM, FL

References

Supplementary material

Suppl. material 1: BOLD_NCBI_Merger script & tutorial 
Authors:  Jan-Niklas Macher, Till-Hendrik Macher, Florian Leese
Data type:  BOLD_NCBI_Merger script & tutorial
Brief description: 

The supplementary material contains the BOLD_NCBI_Merger script, the needed folder structure and the tutorial explaining how to use the script

login to comment