Alignment-free classification of COI DNA barcode data with the Python package Alfie

Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k -mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k -mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences of varying lengths to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package’s functionality to be extended to different DNA sequence classification tasks to suit a user’s need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub ( https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/ ).


Introduction
Biodiversity is declining across the globe. Millions of species face the threat of extinction, and ecosystems are being irreversibly altered due to loss of biomass and changes in species composition (Barnosky et al. 2011;Ceballos et al. 2015). To maintain the health of ecosystems and curb biodiversity loss, informed conservation and management practices are required. Achievement of conservation goals is limited by a lack of fundamental information about species composition for many of the world's ecosystems. It is therefore imperative that technological solutions are developed to enable the accurate and efficient characterization of the world's biodiversity, so that existing species can be catalogued, and informed conservation strategies can be developed to protect the planet's ecosystems.
The field of DNA barcoding offers a technological solution to the problem of taxonomically classifying organismal specimens (Hebert et al. 2003). Instead of relying on laborious and error-prone phenotypic classifications, sequence diversity within standardized gene regions is used to enable both specimen identification and species discovery (Hebert et al. 2003;Ratnasingham and Hebert 2007;Hubert and Hanner 2015). The field has advanced from the barcoding of single specimens to the bulk analysis of samples (Hajibabaei et al. 2011(Hajibabaei et al. , 2016Taberlet et al. 2012a, b;Cristescu 2014), known as metabarcoding, as well as multi-marker (Stefanni et al. 2018) and metagenomics approaches (Cuvelier et al. 2010). These methods have been applied in environmental biomonitoring, where multiple species are identified at once through the collection of environmental DNA (eDNA) (Taberlet et al. 2012a). Despite the widespread adoption of these techniques, a fundamental problem persists: the accurate and repeatable characterization of biodiversity from eDNA and bulk-sample metabarcoding data is difficult, and conclusions drawn from analyses are strongly affected by methodological decisions (Clare et al. 2016;Braukmann et al. 2019).
Environmental biomonitoring often aims to answer ecological questions through the targeted examination of a taxonomic group of interest. DNA barcodes from a group of focus are targeted using group-specific PCR primers for one or more selected marker genes in the PCR amplification step that precedes high-throughput sequencing (Braukmann et al. 2019;Wilson et al. 2019). Some commonly used primers are overly general, which results in the amplification of non-target barcodes, introducing noise into data and confounding efforts to characterize true species composition for targeted taxonomic groups (Brandon-Mong et al. 2015;Zinger et al. 2019). The characterization of biodiversity can be further confounded by intra-group PCR bias, where the over representation of certain taxa within the target group can result in other taxa being overlooked due to poorer amplification and sequencing coverage (Elbrecht and Leese 2015).
Shotgun sequencing of eDNA overcomes the primer issues of eDNA metabarcoding but also produces substantial sequencing noise and sequences from non-standardized genomic regions (Stat et al. 2017;Wilson et al. 2019). A trade-off therefore exists; shotgun sequencing overcomes the amplification bias associated with PCR, but the majority of shotgun sequencing outputs cannot be assigned even high-level taxonomic classifications with confidence (Stat et al. 2017;Singer et al. 2020). Despite present technical limitations, eDNA shotgun sequencing and other next-generation biomonitoring techniques are seeing increased adoption thanks to their potential to characterize biodiversity more broadly (Makiola et al. 2020). Within this next generation of biomonitoring methodologies, tools leveraging machine-learning algorithms and available data will be essential to overcoming the limitations associated with existing methods (Cordier et al. 2019).
The detection of the presence and abundance of species from a specific group is hampered by off-target barcodes that are amplified and sequenced in metabarcoding analysis. Traditionally, the characterization of biodiversity via metabarcoding samples was dependent on the alignment of sequences against a pre-defined set of reference barcodes via methods such as BLAST (Altschul et al. 1990). This method of isolating sequences of interest was limited by the pairwise comparison of novel sequences to all reference entries, which is inefficient when query or reference datasets are large, and is biased against novel genetic variants not present in the reference set. More efficient means of classifying metabarcoding sequences have been developed (Wang et al. 2007;Bengtsson et al. 2011;Weitschek et al. 2014;Bengtsson-Palme et al. 2015), such as Metaxa2, which relies on the comparison of query sequences against pre-trained hidden Markov models (HMMs), which serve as probabilistic representations of sequences from different taxonomic groups (Bengtsson-Palme et al. 2015). This allows for more efficient classification of sequences, as query sequences do not need to be aligned to each reference sequence individually and only need to be evaluated against a much smaller set of HMMs (each of which represents a multiple sequence alignment of numerous reference sequences). This reduces the number of pairwise sequence comparisons required for taxonomic assignment.
Alignment-free methods have been widely applied in biological sequence annotation and classification problems (Abnousi et al. 2016;Zielezinski et al. 2019;Cordier et al. 2018). Alignment-free comparison is defined as any method of quantifying sequence similarity that does not produce an alignment; these methods are generally less computationally intensive and can be as effective as conventional alignments (Bonham-Carter et al. 2014;Zielezinski et al. 2019). To compare sequences without alignment, features must be extracted from sequences in order to characterize their structure. One common set of alignment-free features is DNA k-mer counts, where the number of occurrences of fixed-length DNA words of length k are quantified (Vinga and Almeida 2003;Crusoe et al. 2015). These features can be used as inputs for machine learning models trained to predict classifications such as the taxonomic designation associated with sequences (Solis-Reyes et al. 2018). Machine learning models that operate on k-mer input features have previously been applied in DNA barcode sequence classification and other predictive tasks (Kuksa and Pavlovic 2009;Langenkämper et al. 2014;Ainsworth et al. 2017;Cordier et al. 2017). The application of these tools is often limited to specific taxonomic classification tasks (Kuksa and Pavlovic 2009), or they rely on user-provided sets of sequence data for model training (Langenkämper et al. 2014).
The goals of this study were to develop a high-level alignment-free taxonomic classification tool for metabarcoding and environmental DNA marker gene data. This tool was initially designed for the kingdom-level classification of barcode sequences from the most common animal barcode, a region of the mitochondrial cytochrome c oxidase subunit I (COI) gene. To achieve this, we explored different feature sets (k-mer sizes) and machine learning algorithms to determine the optimal machine learning architecture for alignment-free barcode classification. To make the tool accessible to other researchers, we developed the Python package Alfie. Within Alfie, we also developed an application programming interface (API) to facilitate the construction and testing of customized alignment-free classifiers for any barcode, gene, or taxonom-ic group of interest. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).

Data acquisition
The Barcode of Life Data system (BOLD) (Ratnasingham and Hebert 2007) was queried to obtain all publicly available sequences for the DNA barcode: cytochrome c oxidase subunit I (COI) (https://github.com/CNuge/data-alfie). Sequences were filtered to ensure a minimum length of 300 base pairs (bp). The five kingdom-level classifications used by the BOLD database (Animal, Bacteria and Archaea, Fungi, Plant, Protist) were maintained and utilized as the labels in subsequent classifier development. As a result of BOLD's mandate to catalogue animal biodiversity, the database displays a significant sampling bias towards the animal kingdom. To ensure that models could be trained effectively and not be biased towards animal classification, down sampling of the animal data was performed to ensure more even representation of sequences among kingdoms. Stratified sampling of animal sequences was performed to obtain a representative subsample of 0.2% of the total set of sequences available (sequences were sampled proportionally on the taxonomic level: class; a sample size of 0.2% was chosen as this yielded a set of animal sequences roughly equal to the kingdom with the second highest number of available COI barcodes, plants) (Table 1). To train models to be robust to variable data quality and barcode sequence coverage, each individual barcode sequence was randomly subsampled, with a 200-600 base pair subsection of the complete barcode being retained at random and subsequently utilized in model training and testing.
Prior to splitting the data into a train and test set, a validation set was created to provide a stringent test of the final models' ability to make external predictions. From each kingdom, a complete taxonomic class was withheld to create the validation set and simulate rare or previously unseen sequences that the classification algorithms saw no examples of during training. The class withheld from each kingdom was chosen manually, with selection being based on the distribution of barcodes across the taxonomic classes of the given kingdom. Barcode distribution was variable across kingdoms, so no suitable rule-based selection method was found. Classes with intermediate representation levels within their kingdom were chosen to provide good sample sizes for subsequent classification tests without grossly detracting from the size of available training data. For the protist kingdom, two classes were selected for inclusion in the validation set due to small intra-class barcode counts. The composition of the final validation set is described in Table 2. After the validation set was withheld, the remaining data were randomly split into a train and test (stratified split on level: kingdom), with 80% of the data comprising the training set, and the other 20% being withheld as the test set composed of a taxonomically diverse set of sequences (Table 2; Suppl. material 1: File S1).

Feature set evaluation -k-mer size
Following the train-test split, different sets of alignment-free features were generated, and the accuracy of kingdom-level classifications by the resulting models was tested. For barcode sequences in the training set, k-mer frequencies were generated for values of k from 1 to 6.
K-mer frequencies (count of a given k-mer divided by the total number of k-mers counted in a given barcode) were used as model inputs, so as to standardize the scale of input values and also ensure the models were robust to input sequences of different lengths. For each k-mer feature set, deep neural networks with five hidden neuron layers were trained and evaluated through 5-fold cross validation (neural networks implemented using the package Tensorflow Version 2.1.0, Abadi et al. 2016). The choice of deep neural network-based classifiers with five hidden neuron layers was based on exploratory data analysis and preliminary model construction that showed this architecture to produce effective classifiers. The number of neurons in the hidden layers of the neu- Table 1. The numbers of COI barcode sequences obtained from BOLD for each kingdom and the number of sequences retained within different data sets used in development of the Alfie package. The raw barcode counts represent the complete set of publicly available sequences for the given kingdom. The 'Barcodes utilized' column is the total number of sequences used in the analysis for the given kingdoms after filtering based on minimum sequence length and down sampling to decrease imbalanced representation of the different kingdoms. The breakdown of these sequences between the train, test, and validation data sets is also shown.  ral network were adjusted according to the size of the input feature set (Table 3). The 5-fold loss and accuracy metrics for the neural networks with different k-mer inputs were compared via a one-factor analysis of variance (ANOVA) to determine if there were significant differences in classification accuracy for different feature sets (k-mer sizes) and to select an optimal value of k for further model testing.

Algorithm evaluation
After selection of the optimal k-mer size, five different machine learning models were fit using the training set and optimized through a grid search of hyperparameters. Five classification algorithms were utilized: k nearest neighbour (KNN), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and deep neural network (DNN). All models were deployed using the Python programming language (Version 3.7.4). The KNN, SVM, and RF models were implemented using the package scikit-learn (Version 0.21.3, Pedregosa et al. 2011), the XGB model was implemented using the package XGBoost (Version 0.90, Chen and Guestrin 2016), and the DNN was implemented using the package Tensorflow (Version 2.1.0, Abadi et al. 2016). In order to select optimal hyperparameters and optimize performance, for each algorithm a grid search was performed using scikit-learn's GridSearchCV function to train a series of models on the training data set using 5-fold cross validation (Suppl. material 2: File S2). Optimal hyperparameters were selected based on the highest classification accuracy. For the DNN, a custom grid search script was used, with 5-fold cross validation and several potential values for each of the models' respective hyperparameters (Suppl. material 3: File S3). Following the selection of optimal hyperparameter sets through the grid searches, a final version of each model was trained using the optimal set of hyperparameters and the complete training data set. Final trained models were then used to make predictions for the previously withheld test and validation sets (Tables 1, 2). Predicted classifications were compared to true values to determine the model with the highest classification accuracy. A single optimal alignment-free kingdom-level classifier was selected for inclusion in the Alfie package based on the accuracy of predictions made on the test and validation data, and confusion matrices were prepared to examine rates of misclassification and taxonomic bias. Several secondary classifier characteristics were also considered to ensure model reusability. Specifically, the file size of the trained models and the time required to make predictions were quantified to ensure that the package's memory and time requirements were not prohibitive. The Alfie package was then constructed to allow for the model to be reused in external analyses.

K-mer size
The cross-validation accuracy scores for the different neural networks and corresponding k-mer feature sets were compared to determine an optimal k-mer feature size. The results showed that the accuracy of models improved with increasing k-mer feature size, with diminishing improvements beyond k = 3 (Table 3; Figure 1). A one-factor ANOVA revealed the differences to be significant (p < 2e-16, F statistic = 318.3, DF 1,2 = 5, 24), and a subsequent Tukey's HSD test showed the accuracy of both k = 1 and k = 2 to differ significantly from all larger values of k but no significant differences in the performance of pairwise comparisons between k 3-6. A final k value of 4 was selected for subsequent tests, due to the insignificant differences between the values of k = 3 to k = 6 and the conservative choice to select a k-mer size one larger than the apparent minimal effective feature set. Table 3. The architectures of the neural networks tested in conjunction with the different k-mer feature sets. For each k-mer feature set and corresponding neural network, the average loss and accuracy scores from 5-fold cross validation on the training data (Table 1) are presented. Each neural network was comprised of a dense input layer (neuron number = number of unique k-mers, or 4 k ), five hidden layers of neurons (neuron counts for each layer given in table), and a dense output layer (neuron size equal to number of classes). The input and hidden layers utilized a rectified linear unit (relu) activation function (Agarap 2018), and the hidden layers had dropout rates of 0.3. The final output layer utilized a softmax activation function, and the models were trained using an Adam optimizer (Kingma and Ba 2014), minimizing sparse categorical cross entropy.   (Table 1). Each dot represents an accuracy score for one of the individual folds in the cross-validation corresponding to the given k-mer feature set.

Training and validation
For each of the machine learning algorithms, a grid search was used to obtain an optimal hyperparameter set (Suppl. material 3:File S3). Final models were trained using the complete training data set and then used to make predictions for the test and validation sets (Tables 1, 3). Performance on the test data (withheld barcodes from taxonomic groups otherwise represented in the training data) was strong for all models, with the lowest classification accuracy exceeding 98% (RF) and all other models exceeding 99.5% accuracy (Table 3). All models made less accurate kingdom-level predictions on the validation data (barcodes from taxonomic classes that were completely withheld during training) ( Table 4). The accuracy of predictions on the validation data was variable across models. On the validation data, the accuracy score of the RF model was 0.861, and accuracy for the KNN model was 0.927, indicating poorer generalization for these methods to previously unseen data. Each of the DNN, SVM, and XGB models had accuracy > 97% on the validation data, and the most accurate model was the DNN (0.976 on the validation data, Table 4).

Final model
The DNN (operating on 4-mer input features) was selected as the final default kingdom-level classification model for the Alfie package. The DNN provided the highest accuracy on the validation data, as well as high accuracy on the test dataset. Examination of confusion matrices for the test (Table 5) and validation (Table 6) data showed a relatively even distribution of errors across the kingdoms, and no evidence of bias among the classes. These results indicated that the model was not likely to be over fit to the training data and that it was able to generalize effectively and make predictions about data from previously unseen taxonomic classes. This generalizability of the model to rare or unseen taxa is an important feature that indicates the Alfie package can likely be used effectively in the analysis of under-studied environments where uncharacterized biodiversity is more likely to be present. The 4-mer DNN's high accuracy on the test and validation data indicated that the model can effectively capture an alignment-free taxonomic signal. The model was ro-bust to sequences of variable lengths that spanned various subsections of the COI barcode region (variable start and stop positions in the COI barcode region, as opposed to primer-standardized sub-regions). This indicates that the alignment-free classification by Alfie is an effective method for processing DNA barcoding and metabarcoding (specific subsections of the barcode region in a given study) data, and it may potentially even be applied in the future in analysis of metagenomics data (non-standardized fragments from shotgun sequencing).

Alignment-free model framework
The design and testing of the Alfie package presented here focuses on high-level (kingdom) classification for the most common animal barcode, COI. However, the Alfie package provides a robust framework that a user can easily apply to produce and test alignment-free classification tools for any taxonomic distinction, DNA barcode, or combination thereof (Suppl. material 4: File S4). As a kingdom-level classifier, Alfie acts as an effective data filter, allowing the barcode sequences from a kingdom of interest to be separated from the large amount of off-target noise common in metabarcoding or metagenomics data. The alignment-free methods can be reapplied to further home in on taxonomic targets; for example, using publicly available data (https://github.com/CNuge/data-alfie), a binary classifier can be trained and subsequently deployed with Alfie to allow for any taxonomic group Table 4. The accuracy scores for the predictions made by the five different machine learning models (trained on 4-mer frequency features and the complete training data set (Table 1)). Accuracy on the test (Table 1) and validation (  Table 5. Confusion matrix for predictions on the test set (Table 1) by the final model selected for inclusion in the Alfie package (4-mer DNN, test accuracy = 0.996, validation accuracy = 0.976). The row labels are the true classifications of the sequences (as reported by BOLD) and the column labels are the classifications made by the Alfie package. For example, the value in the fifth column of the first row (4) is the number of true animal sequences that were incorrectly classified as protist sequences by the model, while the value in the first column of the first row (4537)   of interest to be separated from a complete set of COI metabarcoding sequences. Using other publicly available data (i.e. Pruesse et al. 2007;Banchi et al. 2020), the same custom model construction and training tools in Alfie can be used to construct binary or multiclass alignment-free classification tools for other DNA barcodes or genes. Although the Alfie package is an effective alignment-free classification framework at high taxonomic levels, traditional alignments are likely more effective for lower-level classification tasks (i.e. classification to genus or species level). The k-mer frequency method used by Alfie is not likely to be effective for resolving differences between closely related species with more subtle genetic differences than those seen at higher taxonomic levels. Similarly, for taxonomic groups with few representatives and no closely related outgroups, available training data may be scant, providing a limitation in training of DNNs or other machine learning models which rely on abundant training data. The integration of alignment-based and alignment-free methods for biological sequence classification has been shown to leverage the strengths of the individual approaches to yield an efficient and accurate classification method (Borozan et al. 2015). This hybrid methodology has been utilized within other DNA classification tools such as Metaxa2, which utilizes a fast HMM-based model to conduct high-level filtration of sequences and then relies on traditional alignment for finer-scale classification of sequences (Bengtsson-Palme et al. 2015).
A similar hybrid approach using the Alfie package for filtration of sequences and subsequent alignment of sequences for a group of interest can narrow the scope of the application of alignment methods and thereby improve both analysis speed and accuracy. The Alfie package's API allows a user to extend the package to other classification tasks, as functionality is not limited to pre-defined default models or datasets (Suppl. material 4: File S4). The alignment-free model construction framework of Alfie can allow for multiple models to be trained with relative ease and applied in conjunction with one another to isolate barcode sequences of interest from large and messy inputs such as metagenomics data. We propose that effective models could likely be trained and applied to: (a) separate sequences from key mitochondrial genes from other sequences, (b) assign sequences to a barcode or gene of origin, (c) conduct kingdom-level classification for different barcode genes, and (d) conduct classification at intermediate taxonomic levels (i.e. for phylum, class, or order assignment, when sufficient training data are available). Based upon the taxonomic signal we detected, we project that this could likely be accomplished using the same 4-mer frequency data and would allow for messy inputs to be filtered and categorized. Processing of metagenomics data in this manner would allow subsequent alignment effort to be more strategically targeted, improving analysis speed and accuracy. Future comparative study and benchmarking of the Alfie package relative to existing alignment and model-based classifica-tion tools can reveal which methods (or combinations of methods) are most effective in different circumstances, such as classification for: metagenomics data, different barcodes, different taxonomic levels, or when differing amounts of reference sequences are available.

Conclusions
We have developed and tested the Python package Alfie, which extracts k-mer features and uses a neural network to make kingdom-level classifications of COI DNA barcode fragments with greater than 99% accuracy. The Alfie package can therefore be used to separate barcode data for a kingdom of interest from off-target noise, narrowing the scope of subsequent analyses to only relevant data. The model is robust to full-length barcodes and short sequence fragments and is therefore an effective classifier for use in both barcode and metabarcoding analyses. The Alfie package can be incorporated into broader analyses pipelines (Elbrecht et al. 2018;Cordier et al. 2019) and paired with tools that conduct quality control (Callahan et al. 2016;Nugent et al. 2020) and taxonomic annotation (Altschul et al. 1990;Wang et al. 2007) to characterize biodiversity from large and complex data sets. The default model of Alfie is limited to kingdom-level classification for the most common animal barcode, COI. Researchers may expand upon this narrow scope to fit custom research needs by using the training module of Alfie. This allows Alfie to be applied in different taxonomic classification tasks or for the classification of data from different DNA barcodes (where labelled training data are available). The generalized and customized nature of the Alfie package will allow for it to adapt along with the field of biodiversity genomics. As metagenomics becomes more prevalent, the Alfie package can be expanded with additional default models for tasks such as the isolation of mitochondrial DNA or sequences from specific mitochondrial genes from large, messy shotgun sequencing datasets.
Supplementary material 3 File S3 -The parameters utilized in the grid search for each of the five machine learning algorithms tested in the design of the Alfie package Authors: Cameron M. Nugent, Sarah J. Adamowicz Data type: source code Copyright notice: This dataset is made available under the Open Database License (http://opendatacommons.org/licenses/odbl/1.0/). The Open Database License (ODbL) is a license agreement intended to allow users to freely share, modify, and use this Dataset while maintaining this same freedom for others, provided that the original source and author(s) are credited. Link: https://doi.org/10.3897/mbmg.4.55815.suppl3 Supplementary material 4 File S4 -Jupyter notebook with tutorial demonstrating how to apply the Alfie classifier in the Python programming language, and how to train custom alignment-free classifiers using the Alfie training module Authors: Cameron M. Nugent, Sarah J. Adamowicz Data type: source code Copyright notice: This dataset is made available under the Open Database License (http://opendatacommons.org/licenses/odbl/1.0/). The Open Database License (ODbL) is a license agreement intended to allow users to freely share, modify, and use this Dataset while maintaining this same freedom for others, provided that the original source and author(s) are credited.