Software Description |
Corresponding author: Cameron M. Nugent ( nugentc@uoguelph.ca ) Academic editor: Florian Leese
© 2020 Cameron M. Nugent, Sarah J. Adamowicz.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nugent CM, Adamowicz SJ (2020) Alignment-free classification of COI DNA barcode data with the Python package Alfie. Metabarcoding and Metagenomics 4: e55815. https://doi.org/10.3897/mbmg.4.55815
|
Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k-mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k-mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences of varying lengths to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package’s functionality to be extended to different DNA sequence classification tasks to suit a user’s need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).
COI, DNA barcoding, eDNA, environmental DNA, machine learning, metabarcoding, neural network
Biodiversity is declining across the globe. Millions of species face the threat of extinction, and ecosystems are being irreversibly altered due to loss of biomass and changes in species composition (
The field of DNA barcoding offers a technological solution to the problem of taxonomically classifying organismal specimens (
Environmental biomonitoring often aims to answer ecological questions through the targeted examination of a taxonomic group of interest. DNA barcodes from a group of focus are targeted using group-specific PCR primers for one or more selected marker genes in the PCR amplification step that precedes high-throughput sequencing (
Shotgun sequencing of eDNA overcomes the primer issues of eDNA metabarcoding but also produces substantial sequencing noise and sequences from non-standardized genomic regions (
The detection of the presence and abundance of species from a specific group is hampered by off-target barcodes that are amplified and sequenced in metabarcoding analysis. Traditionally, the characterization of biodiversity via metabarcoding samples was dependent on the alignment of sequences against a pre-defined set of reference barcodes via methods such as BLAST (
Alignment-free methods have been widely applied in biological sequence annotation and classification problems (
The goals of this study were to develop a high-level alignment-free taxonomic classification tool for metabarcoding and environmental DNA marker gene data. This tool was initially designed for the kingdom-level classification of barcode sequences from the most common animal barcode, a region of the mitochondrial cytochrome c oxidase subunit I (COI) gene. To achieve this, we explored different feature sets (k-mer sizes) and machine learning algorithms to determine the optimal machine learning architecture for alignment-free barcode classification. To make the tool accessible to other researchers, we developed the Python package Alfie. Within Alfie, we also developed an application programming interface (API) to facilitate the construction and testing of customized alignment-free classifiers for any barcode, gene, or taxonomic group of interest. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).
The Barcode of Life Data system (BOLD) (
Prior to splitting the data into a train and test set, a validation set was created to provide a stringent test of the final models’ ability to make external predictions. From each kingdom, a complete taxonomic class was withheld to create the validation set and simulate rare or previously unseen sequences that the classification algorithms saw no examples of during training. The class withheld from each kingdom was chosen manually, with selection being based on the distribution of barcodes across the taxonomic classes of the given kingdom. Barcode distribution was variable across kingdoms, so no suitable rule-based selection method was found. Classes with intermediate representation levels within their kingdom were chosen to provide good sample sizes for subsequent classification tests without grossly detracting from the size of available training data. For the protist kingdom, two classes were selected for inclusion in the validation set due to small intra-class barcode counts. The composition of the final validation set is described in Table
The numbers of COI barcode sequences obtained from BOLD for each kingdom and the number of sequences retained within different data sets used in development of the Alfie package. The raw barcode counts represent the complete set of publicly available sequences for the given kingdom. The ‘Barcodes utilized’ column is the total number of sequences used in the analysis for the given kingdoms after filtering based on minimum sequence length and down sampling to decrease imbalanced representation of the different kingdoms. The breakdown of these sequences between the train, test, and validation data sets is also shown.
Kingdom | Raw barcode count | Barcodes utilized | Train data set size | Test data set size | Validation data set size (see Table |
---|---|---|---|---|---|
Animal | 1,137,552 | 23,493 | 18,189 | 4,547 | 757 |
Bacteria and Archaea | 5,565 | 5,547 | 4,380 | 1,095 | 72 |
Fungi | 1,407 | 1,368 | 1,038 | 260 | 70 |
Plant | 22,638 | 22,599 | 18,017 | 4,505 | 77 |
Protist | 5,029 | 5,026 | 4,014 | 1,003 | 9 |
Total | 1,172,191 | 58,033 | 45,638 | 11,410 | 985 |
The taxonomic breakdown of the validation data set. For each kingdom, a taxonomic class with a near-average number of sequences in the kingdom’s whole data set was chosen for exclusion from the training set and inclusion in the validation data set. The names of the taxonomic classes and the numbers of barcode sequences withheld from training and testing for subsequent validation are shown.
Kingdom | Withheld class | Sequence count |
---|---|---|
Animal | Diplopoda | 757 |
Bacteria and Archaea | Flavobacteria | 72 |
Fungi | Leotiomycetes | 70 |
Plant | Liliopsida | 77 |
Protist | Heterotrichea and Colpodea | 9 |
Following the train-test split, different sets of alignment-free features were generated, and the accuracy of kingdom-level classifications by the resulting models was tested. For barcode sequences in the training set, k-mer frequencies were generated for values of k from 1 to 6.
K-mer frequencies (count of a given k-mer divided by the total number of k-mers counted in a given barcode) were used as model inputs, so as to standardize the scale of input values and also ensure the models were robust to input sequences of different lengths. For each k-mer feature set, deep neural networks with five hidden neuron layers were trained and evaluated through 5-fold cross validation (neural networks implemented using the package Tensorflow Version 2.1.0,
The architectures of the neural networks tested in conjunction with the different k-mer feature sets. For each k-mer feature set and corresponding neural network, the average loss and accuracy scores from 5-fold cross validation on the training data (Table
K-mer size | NN hidden layers sizes | Average accuracy | Average loss |
---|---|---|---|
1 | [4,64,128,32,16] | 0.684 | 0.899 |
2 | [16,64,128,64,16] | 0.935 | 0.216 |
3 | [64,128,64,32,16] | 0.993 | 0.038 |
4 | [256,128,64,32,16] | 0.994 | 0.033 |
5 | [1024,512,256,64,16] | 0.995 | 0.047 |
6 | [2080,1040,520,260,130] | 0.997 | 0.023 |
After selection of the optimal k-mer size, five different machine learning models were fit using the training set and optimized through a grid search of hyperparameters. Five classification algorithms were utilized: k nearest neighbour (KNN), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and deep neural network (DNN). All models were deployed using the Python programming language (Version 3.7.4). The KNN, SVM, and RF models were implemented using the package scikit-learn (Version 0.21.3,
Following the selection of optimal hyperparameter sets through the grid searches, a final version of each model was trained using the optimal set of hyperparameters and the complete training data set. Final trained models were then used to make predictions for the previously withheld test and validation sets (Tables
The cross-validation accuracy scores for the different neural networks and corresponding k-mer feature sets were compared to determine an optimal k-mer feature size. The results showed that the accuracy of models improved with increasing k-mer feature size, with diminishing improvements beyond k = 3 (Table
Boxplot of the 5-fold cross validation accuracy results for the training of models of different k-mer feature sets and corresponding neural network architectures on the training data (Table
For each of the machine learning algorithms, a grid search was used to obtain an optimal hyperparameter set (Suppl. material
The accuracy scores for the predictions made by the five different machine learning models (trained on 4-mer frequency features and the complete training data set (Table
Algorithm | Test accuracy | Validation accuracy |
---|---|---|
DNN | 0.996 | 0.976 |
Support Vector Machine | 0.996 | 0.974 |
K Nearest Neighbors | 0.997 | 0.927 |
Random Forest | 0.983 | 0.861 |
XGBoost | 0.998 | 0.972 |
The DNN (operating on 4-mer input features) was selected as the final default kingdom-level classification model for the Alfie package. The DNN provided the highest accuracy on the validation data, as well as high accuracy on the test dataset. Examination of confusion matrices for the test (Table
Confusion matrix for predictions on the test set (Table
Animal | Bacteria and Archaea | Fungi | Plant | Protist | |
---|---|---|---|---|---|
Animal | 4537 | 0 | 1 | 5 | 4 |
Bacteria and Archaea | 0 | 1094 | 0 | 1 | 0 |
Fungi | 6 | 4 | 240 | 9 | 1 |
Plant | 0 | 1 | 1 | 4500 | 3 |
Protist | 0 | 1 | 0 | 4 | 998 |
Confusion matrix for predictions on the validation set (Table
Animal | Bacteria and Archaea | Fungi | Plant | Protist | |
---|---|---|---|---|---|
Animal | 744 | 0 | 0 | 2 | 0 |
Bacteria and Archaea | 0 | 59 | 0 | 6 | 7 |
Fungi | 1 | 1 | 65 | 3 | 0 |
Plant | 0 | 0 | 0 | 77 | 0 |
Protist | 2 | 1 | 0 | 1 | 5 |
The design and testing of the Alfie package presented here focuses on high-level (kingdom) classification for the most common animal barcode, COI. However, the Alfie package provides a robust framework that a user can easily apply to produce and test alignment-free classification tools for any taxonomic distinction, DNA barcode, or combination thereof (Suppl. material
Although the Alfie package is an effective alignment-free classification framework at high taxonomic levels, traditional alignments are likely more effective for lower-level classification tasks (i.e. classification to genus or species level). The k-mer frequency method used by Alfie is not likely to be effective for resolving differences between closely related species with more subtle genetic differences than those seen at higher taxonomic levels. Similarly, for taxonomic groups with few representatives and no closely related outgroups, available training data may be scant, providing a limitation in training of DNNs or other machine learning models which rely on abundant training data. The integration of alignment-based and alignment-free methods for biological sequence classification has been shown to leverage the strengths of the individual approaches to yield an efficient and accurate classification method (
A similar hybrid approach using the Alfie package for filtration of sequences and subsequent alignment of sequences for a group of interest can narrow the scope of the application of alignment methods and thereby improve both analysis speed and accuracy. The Alfie package’s API allows a user to extend the package to other classification tasks, as functionality is not limited to pre-defined default models or datasets (Suppl. material
We have developed and tested the Python package Alfie, which extracts k-mer features and uses a neural network to make kingdom-level classifications of COI DNA barcode fragments with greater than 99% accuracy. The Alfie package can therefore be used to separate barcode data for a kingdom of interest from off-target noise, narrowing the scope of subsequent analyses to only relevant data. The model is robust to full-length barcodes and short sequence fragments and is therefore an effective classifier for use in both barcode and metabarcoding analyses. The Alfie package can be incorporated into broader analyses pipelines (
Thank you to Tyler A. Elliott for assisting in the acquisition of data from the BOLD database. Thank you to Christopher A. Hempel for helpful discussions during the initial conceptualization and design of the Alfie package. Thank you to Christopher A. Hempel, Rami Baghdan, and Nora Samhadaneh for feedback on the initial draft of the manuscript.
Funding for this research was obtained from grants in Bioinformatics and Computational Biology from Genome Canada through Ontario Genomics and from the Ontario Ministry of Economic Development, Job Creation and Trade. Funders played no role in study design or decision to publish. This research was enabled in part by resources provided by Compute Canada (www.computecanada.ca).
File S1 – Training, test, and validation data sets used in model training and analysis
Data type: source code
File S2 – Python script for custom grid search of hyperparameters for optimization of the neural network
Data type: source code
File S3 – The parameters utilized in the grid search for each of the five machine learning algorithms tested in the design of the Alfie package
Data type: source code
File S4 – Jupyter notebook with tutorial demonstrating how to apply the Alfie classifier in the Python programming language, and how to train custom alignment-free classifiers using the Alfie training module
Data type: source code