BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems

DNA metabarcoding workflows produce hundreds to ten-thousands of Operational Taxonomic Units (OTUs) or Exact Sequence Variants (ESVs) per analysis. In most workflows, a taxonomic assignment to these generated sequences is needed. This is typically done using publicly available databases. Especially, yet not exclusively, for Eumetazoan metabarcoding, the Barcode of Life Data system (BOLD) is the most comprehensive and curated reference barcode database and, therefore, typically the first choice for taxonomic assignment. While an application programme interface (API) exists to query data in large batches, no information on the many and important unpublished data are obtained through the API. The alternative approach using the BOLD identification engine on the website provides full access, yet it is restricted to 100 sequences at once. We developed a small platform-independent and graphical user interface (GUI) software package, BOLDigger, which aims to solve this problem by automating the process of sending successive requests of up to 100 sequences without surpassing the capacities of BOLD. BOLDigger can be used to download the results of the identification engine, as well as metadata for the obtained hits. For the selection of the best fitting hit, three different methods are implemented. A new approach, combining a threshold-based approach with the metadata information, was implemented to make use of the metadata.


Introduction
DNA metabarcoding is a cost-and time-effective method to assess species diversity of bulk or environmental samples (Taberlet et al. 2012;Yu et al. 2012;Elbrecht and Steinke 2019). DNA metabarcoding datasets often consist of hundreds or even thousands of Operational Taxonomic Units (OTUs) or Exact Sequence Variants (ESVs), which need to be queried against databases to assign taxonomy. The Barcode of Life Data System (BOLD) offers such a database with more than 7 million reference sequences (Barcodes) for the primary barcode sequence in the animal kingdom, the mitochondrial cytochrome c oxidase I gene fragment (COI) (Ratnasingham and Hebert 2007). The database also supports plant reference barcodes, with about 500,000 sequences of the ribulose bisphosphate carboxylase and maturase K genes (rbcL & matK) and fungi, with about 150,000 reference sequences of the Internal Transcribed Spacer region (ITS).
The BOLD Identification System (IDS) can be used to identify an unknown query sequence via the website or the provided (fast) API by tracing and returning the nearest neighbours to the query sequence from a global alignment of all reference sequences (Ratnasingham and Hebert 2007). While the identification engine of the website is limited to 100 sequences at once, one downside of using the faster API for sequence identification is that it only provides access to published COI records while the website also provides private and early release data that represent about 50% of all records on BOLD (Weigand et al. 2019). Even though these records are less trustworthy, since the underlying data are not accessible, they still hold valuable information that can be used for sequences that lack publicly available reference data. While it is assumed that, with growing data, the IDS will deliver a definite species-level hit for a given sequence (Ratnasingham and Hebert 2007), this is still very often not the case. For example, Weigand et al. (2019) showed that, of the 4504 freshwater macroinvertebrates used for routine monitoring in Europe, about 65% of the species are represented by at least one barcode, showing that there are still large gaps to fill, even for important groups like freshwater macroinvertebrates. Therefore, BOLD applies conservative rules to return a so-called top-hit that solely rely on sequence similarity and gives access to all available information about the chosen reference sequence (Ratnasingham and Hebert 2007). Most often the chosen top-hit simply is the first hit of the first 99 nearest neighbours, even if there are other records with a similarity above 99%. To avoid this, a threshold-based approach, including thresholds for different taxonomic ranks that also consider the metadata, was implemented in BOLDigger.
Sequence similarity thresholds are used for taxonomic assignment across all domains of life (Hebert et al. 2003;Venter et al. 2004;Fazekas et al. 2008). Despite being criticised to not be applicable for all taxonomic groups and amplicon lengths (Mahé et al. 2015) or being different between taxonomic groups (Kvist 2016;Meyer and Paulay 2005), they have strong empirical support, especially for species level, for large groups, such as birds, fish and several insect orders (Hebert et al. 2003;Virgilio et al. 2010;Ward et al. 2005). For genus, family, order and higher ranks, the sequence similarities differ between taxonomic groups, due to the different evolutionary histories and mutational speeds. However, by using conservative threshold values for the different taxonomic levels, false positives can effectively be prevented while losing taxonomic resolution (Ratnasingham and Hebert 2007). More comprehensive reference databases can solve this challenge.
The presented Python package BOLDigger aims to act as an interface for species identification, to download additional data and organisation of these. As a platform-independent, open-source tool, it can be used to collect IDS results from BOLD, including private and early release data. It also provides the user with additional data for all public references in the dataset, as well as implementing a safer way to determine the top-hit by combining a threshold-based approach with the additional information provided by BOLD. To improve user-friendliness, a BOLDigger comes with a GUI (Fig. 1).

Package description
The Python package BOLDigger (version 1.1.5) is available from the Python Package Index (PyPI) at https:// pypi.org/project/boldigger/. It can be installed using the Python package installer (pip) with the command pip install boldigger. In case both python version 2 and 3 are installed on the operating system, the correct version of pip has to be used (pip3 install boldigger). All operating sys-tems (Windows, Linux and MacOS) are supported, as long as Python 3 is installed. It can be started with the command boldigger from the command line after installation. Updates can be automatically downloaded and installed with the command pip install --upgrade boldigger. Further information about installation, the current version and troubleshooting are provided via the GitHub repository page (https://github.com/DominikBuchner/BOLDigger).
BOLDigger comes with a GUI for easy operation (Fig.  1). All output will be saved to the output folder. Since a login is required to use the IDS for more than one sequence, an account at BOLD and its user data is required by BOLDigger. BOLDigger can query all three databases of BOLD by using the "BOLD identification engine" command. The batch size controls the number of sequences to be queried at once (e.g. a fasta file containing 1000 sequences will send 10 successive requests). All results are saved to an excel file. This file can be used to download additional data with the "Search for additional data" command. Additional data will simply be added to this file. The "Add a list of top hits" command adds a list of top-hits to a new worksheet of the result file with different methods. For a detailed description of the different ways, please consult the GitHub repository (https://github.com/DominikBuchner/BOLDigger).

Conclusions
BOLDigger is a platform-independent GUI software package that allows users to query metabarcoding data against the BOLD sequence database in a simple fashion. It facilitates data analysis and provides alternative approaches for the assignment of the best hit.

Project description
Title: BOLDigger -a Python package to identify and organise sequences with the Barcode of Life Data systems

Author contributions
Conceived and designed the study: DB; Wrote the Python package: DB; Wrote the paper: DB, FL