Assessment of species gaps in DNA barcode libraries of non- indigenous species (NIS) occurring in European coastal regions

DNA metabarcoding has the capacity to bolster current biodiversity assessment techniques, including the early detection and monitoring of non-indigenous species (NIS). However, the success of this approach is greatly dependent on the availability, taxonomic coverage and reliability of reference sequences in genetic databases, whose deficiencies can potentially compromise species identifications at the taxonomic assignment step. In this study we assessed lacunae in availability of DNA sequence data from four barcodes (COI, 18S, rbcL and matK) for NIS occurring in European marine and coastal environments. NIS checklists were based on EASIN and AquaNIS databases. The highest coverage was found for COI for Animalia and rbcL for Plantae (up to 63%, for both) and 18S for Chromista (up to 51%), that greatly increased when only high impact species were taken into account (up to 82 to 89%). Results show that different markers have unbalanced representations in genetic databases, implying that the parallel use of more than one marker can act complimentarily and may greatly increase NIS identification rates through DNA-based tools. Furthermore, based on the COI marker, data for approximately 30% of the species had maximum intra-specific distances higher than 3%, suggesting that many NIS may have undescribed or cryptic diversity. Although completing the gaps in reference libraries is essential to make the most of the potential of the DNA-based tools, a careful compilation, verification and annotation of available sequences is fundamental to assemble large curated and reliable reference libraries that provide support for rigorous species identifications.


Introduction
Marine and coastal habitats are among the most important, but also the most threatened ecosystems in the world, providing invaluable services, such as provisioning, supporting, regulation and cultural/aesthetic for human well-being (Solan et al. 2004;Rilov and Crooks 2009). Along with climate change, habitat destruction, overexploitation and pollution, the spread of non-indigenous species (NIS), for areas outside their natural occurrence range, is among the five most important direct drivers of biodiversity loss in European coastal regions (MEA 2005). Due to its position as a centre for international trade over centuries, Europe has a large number and diversity of well-established NIS in marine and coastal habitats (Keller et al. 2011;Katsa-nevakis et al. 2013aKatsa-nevakis et al. ,b, 2014Tsiamis et al. 2019). Many of these species are, or can become, invasive and displace and out-compete native species, leading to severe ecological changes threatening ecosystem integrity (Molnar et al. 2008;Rilov and Crooks 2009;Simberloff et al. 2013). Impacts include community structure alterations, biodiversity loss, habitat modification, harm to human health and economic losses (Keller et al. 2011).
While morphology-based identification of taxa has largely ensured the ascertainment of the current status of NIS occurring in coastal environments in Europe (Keller et al. 2011;Katsanevakis et al. 2013aKatsanevakis et al. , 2014Tsiamis et al. 2019), this process is expertise-demanding, laborious and time consuming. It is also hardly applicable to some poorly studied communities, such as interstitial fauna, which may have moved large distances in ships through ballast waters or sediments (Carlton 1999;Ojaveer et al. 2014;Shang et al. 2019;Shaw et al. 2019). Particularly in aquatic systems, an accurate identification and detection of NIS may be prevented by the presence of life stages not amenable to morphological identification (i.e. eggs, propagules, planktonic larvae, juveniles), or because organisms are not large and distinctive (e.g. meiofauna, microalgae, zooplankton, protists) (Pochon et al. 2015;Zaiko et al. 2015aZaiko et al. , 2015bZaiko et al. , 2015cZaiko et al. , 2016Pagenkopp Lohan et al. 2016, 2017 or occur in low abundances (Darling and Blum 2007). In the case of NIS, the accuracy of species identifications is paramount since incorrect identifications can lead to biased outcomes and action against harmless species or inaction against problematic ones (Briski et al. 2016;Viard et al. 2019).
DNA-based methods, such as DNA barcoding (Hebert et al. 2003) and DNA metabarcoding (Hajibabaei 2012;Cristescu 2014), offer great promise for reliable species identifications in invasive ecology, having considerable potential to overcome some of the above-mentioned challenges and to improve the monitoring of NIS in marine and coastal ecosystems (Briski et al. 2011(Briski et al. , 2016Zaiko et al. 2015aZaiko et al. , 2015bZaiko et al. , 2015cAbad et al. 2016;Miralles et al. 2016Miralles et al. , 2018Holman et al. 2019;Wood et al. 2019;Rey et al. 2020). In particular, DNA metabarcoding, which allies amplicon barcoding with high throughput sequencing may have a number of potential benefits over traditional methods, including the simultaneous processing of a large number of samples and the simultaneous identification of multiple taxa from various types of environmental samples (Hajibabaei 2012;Taberlet et al. 2012;Shokralla et al. 2012;Cristescu 2014), increased sensitivity and specificity, often revealing hidden diversity (Lindeque et al. 2013;Viard et al. 2019), as well as greater time and cost effectiveness (Briski et al. 2011(Briski et al. , 2016Pochon et al. 2015;Brown et al. 2016;von Ammon et al. 2018;Holman et al. 2019;Rey et al. 2020). In addition, a species may be detected at early developmental stages and before its dispersal and impact become readily apparent and irreversible (Pochon et al. 2015;Holman et al. 2019).
Efficient and accurate species identifications through DNA barcoding or DNA metabarcoding are dependent on reliable reference sequences libraries of known taxa (Briski et al. 2016;Miralles et al. 2016Miralles et al. , 2018Viard et al. 2019;Weigand et al. 2019). The unavailability or under-representation of some taxonomic groups in genetic databases may lead to biased results in biodiversity assessments through DNA-based tools and restrict the resolution and detection capacity of NIS at the taxonomic assignment step (Pochon et al. 2015;Briski et al. 2016;Chain et al. 2016;Zaiko et al. 2016;Lacoursière-Roussel et al. 2018;von Ammon et al. 2018;Rey et al. 2020 (Leese et al. 2016Hering et al. 2018;Pawlowski et al. 2018;Weigand et al. 2019). However, no recent attempt has been made for assessing the gaps in NIS sequences in publicly accessible databases (i.e. the number of species missing barcode sequences). To our best knowledge the most recent complete gap-analysis was conducted in 2016 (Briski et al. 2016) and although a recent one was performed in 2019 (Ardura 2019), it targeted only high-impact Arthropoda and Mollusca species.
In the current study, the gaps, for the most commonly used barcode markers in DNA-based studies for Animalia (COI and 18S), Chromista (COI, 18S and rbcL) and Plantae (COI, rbcL and matK), were analysed in the genetic databases GenBank and the Barcode of Life Data System (BOLD), with a focus on NIS occurring in European coastal regions by using updated lists retrieved from the European Alien Species Information Network (EASIN) (Katsanevakis et al. 2012) and the Information System on Aquatic Non-indigenous and Cryptogenic species (AquaNIS) ). This will allow a current appraisal of the status of NIS occurring in European marine and coastal regions that are missing DNA barcodes, and will permit researchers to develop actions to fulfil these gaps, in order to take the most of the potential of NIS identification through DNA-based tools. Actions developing innovative tools for biodiversity monitoring are mandatory for an effective management of biological invasions and the development of mitigation strategies to deal with increasing globalization and environmental change.

Species checklists
The lists of non-indigenous species (NIS) occurring in European marine coastal regions were assessed using the two most important databases that compile crucial information on non-indigenous species occurring in Europe, on 23 th October 2019: the European Alien Species Information Network (EASIN) (https://easin.jrc. ec.europa.eu/easin) (Katsanevakis et al. 2012) and the Information System on Aquatic Non-indigenous and Cryptogenic species (AquaNIS) (http://www.corpi.ku.lt/ databases/index.php/aquanis/) ). The EASIN catalogue, built by the European Commission's Joint Research Center (JRC), is based on an inventory of reported alien species in Europe that was produced by reviewing and standardizing existing information from 43 online databases and selected offline sources, which include the terrestrial and aquatic environments, with 34 of the databases reporting NIS in the marine environment (Katsanevakis et al. 2012). The AquaNIS is an advanced information system that deals in particular with aquatic NIS introduced to marine, brackish and coastal freshwater environments of Europe and adjacent regions . From the AquaNIS list we retrieved 1,172 species using as search criteria the "Recipient region" and the following sub-criteria; "Ocean": Atlantic + Arctic; "Ocean Region": NE Atlantic + Arctic; "LME": 20. Barents Sea + 21. Norwegian Sea + 22. North Sea + 23. Baltic Sea + 24. Celtic-Biscay Shelf + 25. Iberian Coastal + 26. Mediterranean Sea + 59. Iceland Shelf + 60. Faroe Plateau + 62. Black Sea + A1. Macaronesia. From the EASIN list we retrieved 1,566 species using the following criteria; "Environment": Marine + Oligohaline; "Impact": All (High + Low/Unknown); "Species status": Alien + Cryptogenic + Unknown; "Taxonomy": Animalia + Chromista + Plantae and "Pathways": Contaminant + Corridor + Escape + Release + Stowaway + Not assessed + Other + Unknown.
The taxonomic classification and name validation of the NIS compiled in the lists was made through the World Register of Marine Species (WoRMS) database (www.marinespecies.org) and the Algaebase (https:// www.algaebase.org/). Both databases adopted the Cavalier-Smith's taxonomic classification system (Cavalier-Smith 1981). In this classification system, Chromista were established to include all chromophyte algae (those with chlorophyll c, not b) considered to have evolved by symbiogenetic enslavement of another eukaryote (a red alga), as well as heterotrophic protists that descended from them by loss of photosynthesis or entire plastids (Cavalier-Smith 1981, Ruggiero et al. 2015, which in our lists include the phyla: Bigyira, Cercozoa, Ciliophora, Cryptophyta, Foraminifera, Haptophyta, Myzozoa and Ochrophyta. All records without species level identifications were removed from the lists. Initially, to conduct the gap-analysis, alternate representations of the species names were maintained in the lists. Later, to simplify the display of the results, all replicated records were removed and the number of sequence hits were merged to accepted names. The final list included species belonging to three Kingdoms: Animalia, Chromista and Plantae. Bacteria and fungi were excluded from the AquaNIS list, because these taxa typically have uncertain status as non-indigenous or native. Birds and mammals were also excluded from the EASIN list.

Data-mining, processing and analyses
For each species in the lists, and within each taxonomic group (i.e. Animalia, Chromista and Plantae), the number of sequences available in the Barcode of Life Data System (BOLD) (www.barcodinglife.org) (Ratnasingham and Hebert 2007) was assessed using the package "bold" implemented in the R 3.6.0 software (R core Team 2019; www.r-project.org) (Chamberlain 2019). For retrieving the number of sequences available on GenBank (www. ncbi.nlm.nih.gov/Genbank) the package "rentrez" was used (Winter 2017). Only public records were retrieved because this information is available to all the users and details on private data cannot be easily accessed (e.g. ge-netic marker, sequence quality). The following markers were searched for each group: Animalia -COI and 18S; Chromista -COI, 18S and rbcL; Plantae -COI, rbcL and matK. The terms used to filter the sequences from BOLD were: for COI -"COI-5P"; for 18S -"18S" or "18Sa", for rbcL -"rbcL" or "rbcLa" and for matK -"matK". In GenBank, the terms used for the search were (suggested by GenBank for the studied loci): for COI -"COI [ ". Only sequences with more than 500 base pairs were considered, since this is the minimum length required for a sequence to meet Barcode Compliance standards (Ratnasingham and Hebert 2007) and which has also been used in earlier gap-analysis studies of European aquatic invertebrates (Weigand et al. 2019).
The number of Barcode Index Numbers (BINs; proxy of Molecular operational taxonomic units -MOTUs) (Ratnasingham and Hebert 2013), for each species within each group, were retrieved from BOLD, based on the COI marker. Records co-occurring in both databases were detected through the presence of the tag "Mined from GenBank, NCBI" in BOLD's records and/or availability of a GenBank's accession number, which indicates that those BOLD records were mined/deposited from/to GenBank. A species was considered to be successfully barcoded for each marker if it had at least one compliant sequence in one of the searched databases. All the details of the bioinformatic pipeline, such as the scripts used for each taxonomic group and the markers searched, can be consulted at https://github.com/pedroemanuelvieira/ NIS_Europe_GapAnalysis.

Taxonomic composition of the lists
After removal of the records with taxonomic ranks higher than species level and replicated records, the final AquaNIS list had 1,120 species and the final EASIN checklist had 1,554 species (Fig. 1). The taxa in both lists belonged to three kingdoms (data in parentheses correspond to % in AquaNIS and EASIN lists, respectively): i) Animalia (66 and 76%), ii) Chromista (17 and 14%) and iii) Plantae (17 and 10%), comprising 28 phyla ( Chromista and 43 Plantae) (Suppl. material 1: Table S1 and Fig. S1). In the EASIN list, 1,294 species have the status of "alien", 174 species of "cryptogenic" and 86 species have the status of "questionable" (Suppl. material 1: Table S1) and 148 out of the 1,554 species (approximately 10% of the total number of species in the list) are classified as high impact species, with 118 species belonging to Animalia, 17 to Chromista and 13 to Plantae ( Fig. 1; Suppl. material 1: Tables S1, S3).

Gap analysis
For all analysed taxonomic groups (Animalia, Chromista and Plantae), a higher number of records was found on GenBank than on Public BOLD (Table 1). When considering at least the presence of one barcode sequence of at least one marker in at least one genetic database, a total barcode coverage between 58 and 68% and between 50 and 63% was found for the AquaNIS and EASIN list, respectively (Table 1). But the coverage varied considerably among the different taxonomic groups and barcode markers (Table 1). The highest coverage was found in both lists for Animalia and for the COI marker (63 and 51%, for AquaNIS and EASIN, respectively), for Chromista for the 18S marker in the AquaNIS list (51%) and for Plantae for the rbcL marker, in both lists (62 and 63%, for the AquaNIS and EASIN list, respectively) ( Table 1). In addition, in particular for Animalia and for the 18S marker, the % of sequences represented by single barcode records in the databases (singletons) was relatively high (38 and 40% for the AquaNIS and EASIN lists, respectively).
For Animalia, in both lists, the phyla with the highest number of total records, taken into account all searched markers in both genetic databases, were Arthropoda (10,863 and 10,148), Chordata (12,478 and 11,808) and Mollusca (7,146 and 6,045, for the AquaNIS and EASIN lists, respectively) (Suppl. material 1: Tables S5, S6). In general, a higher coverage was found for the COI marker than for the 18S marker in both lists ( Fig. 2A, B; Table  1), with the exception of Annelida (only for AquaNIS), Ctenophora, Platyhelminthes and Porifera, where a higher coverage was found for 18S ( Fig. 2A, B). In the AquaNIS list, and within Animalia, most phyla had a barcode coverage higher than 50% for the COI marker, with the exception of Annelida (41%), Bryozoa (35%), Platyhelminthes (18%) and Porifera (33%), while no barcodes at all were found for Entoprocta ( Fig. 2A). For 18S, a barcode coverage near to or higher than 50% was found for Annelida (46%), Arthropoda (49%), Cnidaria (58%), Ctenophora (83%) and Porifera (47%) (Fig. 2A). On the other hand, for the EASIN list, most of the phyla had a barcode coverage lower than 50% with the exception of Arthropoda (52%), Chaetognatha (50%), Chordata (89%), Echinoder- mata (67%) and Nematoda (75%), for COI, and Ctenophora (80%) and Nematoda (75%), for 18S (Fig. 2B). For Chromista, in both lists, Ochrophyta was the phyla which included the highest number of total records, taking into account all searched markers in both genetic databases (2,188 and 1,983, for the AquaNIS and EASIN respectively) (Suppl. material 1: Tables S5, S6). The barcode coverage among the different markers differed depending on the target phyla (Fig. 2C, D), except for Haptophyta, for which a barcode coverage of 50% and 100% was found for the 3 searched markers (COI, 18S, rbcL), in the AquaNIS and EASIN lists, respectively. For COI, the barcode coverage was always lower than 50% for all remaining analysed phyla, while no COI sequences were found for Bigyra, Cercozoa, Cryptophyta and Foraminifera, in both lists (Fig. 2C, D). Cryptophyta was also not represented by any 18S or rbcL sequence in BOLD and GenBank, for both lists, but it is represented in both lists by only one NIS. The 18S was the most well represented marker in both lists, in particular for Cercozoa (50 and 60%), Ciliophora (46 and 42%), Myzozoa (45 and 42%) and Ochrophyta (58 and 48%, for AquaNIS and EASIN, respectively), while Ochrophyta were better represented by rbcL sequences (59 and 58%, for AquaNIS and EA-SIN, respectively), but not the other phyla (Fig. 2C, D).
For Plantae, in both lists, Rhodophyta was the phyla which included the highest number of total records in both genetic databases, taking into account all markers (2,362 and 1,931, for the AquaNIS and EASIN lists, respectively) (Suppl. material 1: Tables S5, S6) and similarly to Chromista, the barcode coverage differed among the different markers and the target phyla (Fig. 2E, F). A better barcode coverage was generally found for the rbcL marker and for the four analysed phyla (equal or higher than 60%), in both lists (Fig. 2E, F), while COI sequences were found for Chlorophyta, Rhodophyta and Tracheophyta in the AquaNIS list (25 to 43%), but only for Chlorophyta and Rhodophyta in the EASIN list (23 and 42%, respectively). MatK sequences were exclusively found for Charophyta and Tracheophyta (100 and 75%, respectively, for the AquaNIS, and 100%, for both phyla in the EASIN list) (Fig. 2E, F).

Gap-analysis for high impact species
Considering only the high impact species from the EAS-IN list, the gap was much lower for all analysed groups and barcode markers, than for the full lists ( Fig. 3; Table  2). When considering at least the presence of one barcode sequence of at least one marker in at least one genetic database, a total barcode coverage between 82 and 93% was found for the high impact species (Table 2). In general, coverage was higher than 50% for all analysed groups and barcode markers, with the exception of rbcL for Chromista (35%) and matK for Plantae (8%) ( Table 2). For Animalia, the highest number of total records, considering all searched markers in both genetic databases, was found for Arthropoda, Mollusca and Chordata (2,797 to 4,595) (Suppl. material 1: Table S7). At the phyla level a barcode coverage of 100% was found for Ctenophora, Echinodermata and Nematoda, for both markers, and also for Chordata and Cnidaria, for COI (Fig. 3A). For Chromista and Plantae, the highest number of total records were found for Myzozoa (209) and Rhodophyta (210), respectively (Suppl. material 1: Table S7). Within Chromista, a barcode coverage of 100% was found for Haptophyta, for all analysed markers, and for Myzozoa for 18S ( Fig. 3B), while for Plantae, for Tracheophyta, for both rbcL and matK (Fig. 3C).

Discussion
Our study brings to the forefront two main considerations: first, reference libraries still lack representative sequences for many NIS with extreme cases in some groups, and second, some NIS can be categorised as possible cryptic species. Both these cases may critically impair the detection of NIS and therefore, the current capability for NIS detection and monitoring using molecular tools. Although the gaps (i.e., NIS still missing barcode sequences) were similar in both lists, the values of missing barcodes clearly differed among taxonomic groups and the barcode markers searched. In both lists the gap was highest for Chromista. In these lists, Chromista include Foraminifera, Myzozoa and Ochrophyta as dominant phyla, that can harbour very small sized species, such as small protists and diatoms and for which obtaining voucher specimens to generate sequences to deposit in genetic databases may be challenging. It has been reported that smaller organisms may have greater invasion opportunities in coastal ecosystems (Ruiz et al. 2000;Pagenkopp Lohan et al. 2016, 2017, but that can be hard to detect by using traditional morphological approaches (Pagenkopp Lohan et al. 2016, 2017. Thus, DNA-based tools are essential for its early detection and accurate identification in recipient ecosystems and fulfilling the gaps in barcode reference libraries is extremely essential for Chromista. On the other hand, we found a lower gap for Animalia and Plantae. The gaps in BOLD and GenBank were recently analysed for the taxa frequently used in the WFD and the MSFD, under the scope of the COST Action DNAqua-Net (Weigand et al. 2019), and the authors also found that barcode coverage varied strongly among taxonomic groups. In general, groups that were actively targeted in barcode projects were well represented in the barcode libraries, while others have fewer records. Our results support this trend. Under the scope of the public project "WG1.8 Marine Bio-Surveillance" deposited in BOLD, 12 and 17 projects were dedicated to Animalia and Plantae, respectively, with a total of 1,516 sequences, while only 4 projects were dedicated to Chromista, comprising only 105 sequences. In both lists, the phyla with the highest number of records in the two searched genetic databases include a high number of species having a high impact in the environment or species with high economic value (i.e. Chordata, Arthropoda, Mollusca, Ochrophyta, Rhodophyta). These species are generally the focus of a greater number of studies and thus, may display a higher trend of sequence deposition in genetic databases (Briski et al. 2011(Briski et al. , 2016Pyšek et al. 2008;Trebitz et al. 2015;Ardura 2019). In fact, we found among the top ten species with the highest number of sequence records either high impact species, such as Callinectes sapidus and Anguillicoloides crassus, or species with high economic value such as Mytilus trossulus, Prionace glauca, Cyprinus carpio and Oncorhynchus mykiss.
Our results were somewhat discrepant from those obtained in a previous report where the gaps in BOLD and GenBank were analysed for aquatic NIS compiled from literature (Briski et al. 2016). By 2016, 76% of the species in the list, compiled by Briski and colleagues for aquatic NIS (n=1,383), had at least one sequence of 6 searched markers in BOLD or GenBank. In addition, the authors predicted that if the rate of sequence deposition in both genetic databases followed a linear trend, they would expect that all aquatic NIS in their list would be sequenced by 2030. In our study, completion seems to be still a bit far off with only 65% of the species in the AquaNIS and 55% in the EASIN list having at least one of the searched barcode markers in BOLD or GenBank. These disparities probably originated from different compliance criteria and mismatching of the species lists used in the analyses, which in the case of Briski and colleagues (2016) consisted on a list of NIS occurring at a worldwide scale. In addition, only barcode sequences higher than 500 bp were considered in the current gap analysis, while Briski et al. (2016) did not mention if any length filter has been applied to their sequences search. New NIS and new introductions into different recipient regions are reported every year and NIS status can also change (from unknown status to cryptogenic or alien), suggesting that this is a work that needs to be performed from time to time. Fortunately, currently, there are specific databases dedicated to this, and that are constantly updated, such as EASIN and AquaNIS (Katsanevakis et al. 2012;Olenin et al. 2014), which greatly facilitates this task. In addition, the R-based bioinformatic pipeline, developed in our study to retrieve the information relative to each marker from the two genetic databases, will enable to conduct this task effectively and in an automated way when needed (i.e. every time that significant updates are made in the lists).
As above-mentioned, for each taxonomic group, the gap clearly differed among the barcode markers searched.
For Animalia, most phyla were well represented with COI sequences in GenBank and BOLD, but Annelida, Ctenophora, Platyhelminthes and Porifera were better represented with 18S sequences. Within Chromista most phyla were better represented with 18S, but for instance Ochrophyta, which includes brown algae and diatoms, was an exception to this pattern, with the barcode coverage being greatest for rbcL. For Plantae, most phyla were better represented with rbcL sequences. Thus, the simultaneous use of more than one marker can act complimentarily and may greatly increase NIS identification rates through DNA-based tools. Recent studies have highlighted the advantage of using both 18S and COI markers for invasive species detection; the 18S for detecting a much broader range of taxa and the COI for discriminating between many metazoan species (Borrell et al. 2017;von Ammon et al. 2018;Stefanni et al. 2018;Holman et al. 2019;Wood et al. 2019;Rey et al. 2020). In addition, the concomitant use of the rbcL and COI allowed the detection of diatoms and green and yellow algae, in ballast water of a vessel crossing the Atlantic Ocean, which otherwise would remain highly underestimated if communities have been only targeted with COI (Zaiko et al. 2015b).
Approximately 37% of the species displayed more than one BIN, and many of these species displayed mean-and maximum-intraspecific distances higher than 3%, suggesting that many NIS may display hidden diversity or cryptic diversity, which may further complicate taxonomic assignment using DNA-based tools (Viard et al. 2019). In addition, many species were represented by singletons in the genetic databases, thereby preventing detection of possible intraspecific variability or cryptic diversity. At the moment, at least to our knowledge, no dedicated reference sequences database exists for NIS. Ideally, and also suggested by the great proportion of species displaying multiple BINs and high intraspecific distances in the current study, this reference database should cover the full sweep of species in the target ecosystem, with a balanced representation of specimens across each species distribution range in both native and recipient locations, to account for the possible regional variability in targeted barcode genes. In addition, database incompleteness can be somewhat overcome by the addition of DNA sequences for local species. Abad et al. (2016) was able to increase 2 times more the success of the taxonomic assignment of plankton species in the estuary of Bilbao (Spain), by generating DNA barcodes for local species before conducting a metabarcoding-based study.
A closer look at the list of barcoded species with attributed BINs, in particular for COI and Animalia, indicated that many of them displayed discordant BINs (i.e. different species sharing the same BIN), possibly due to incorrect taxonomic assignments of numerous species, that have been repeatedly used in databases without a proper validation. A careful inspection in these BINs would be needed in order to check for potential artefacts such as misidentifications, incomplete taxonomy or sequences that were deposited under different synonyms. Incorrect species identifications could either artificially inflate or depress the number of NIS in an ecosystem, and lead to misdirecting limited resources against harmless species or inaction against problematic ones (Bax et al. 2001;Simberloff 2009). Lacoursière-Roussel et al. (2018) identified Acartia tonsa through eDNA metabarcoding, in water samples collected at two Canadian ports, a potential invader that has been previously recorded in the ecoregions of ports connected to Churchill. However, the current available COI sequences for A. tonsa form several distinct clades, some of which cluster with A. hudsonica, which rose the possibility that the eDNA sequences assigned to A. tonsa may belong to the native A. hudsonica. Very recently, by examining public databases Viard et al. (2019) also found sequences of Botrylloides diegensis erroneously assigned to B. leachii. This observation has major implications as the introduced B. diegensis can be misidentified as a putatively native species. Unfortunately, these database errors can be frequent, as also suggested by the high proportion of discordant BINs found in the current study, and can delay the implementation of DNA metabarcoding in NIS surveillance in coastal ecosystems.

Final remarks
Although completing the gaps in reference libraries is essential to make the most of the potential of DNA-based tools in NIS surveillance in coastal ecosystems, correct species attribution (by morphology-based methods) and proper management of sequence deposition and voucher storage is vital to preserve correct connections between morphological and molecular data (Briski et al. 2016). This can be particularly challenging for small-sized species that lack unambiguous morphological traits to use in taxonomic diagnosis, such as some particular groups within Chromista (e.g. Myzozoa), for which a higher gap was found in genetic databases. In addition, a careful compilation, verification and annotation of each database record is fundamental to assemble large, curated and reliable reference libraries that provide support for rigorous species identifications through DNA-based tools (Viard et al. 2019;Weigand et al. 2019;Fontes et al. 2020;Leite et al. 2020). This need is particularly acute for the phylogenetically diverse NIS, for which there is highly dispersed data that needs to be compiled and verified. Once this need is fulfilled, the adoption of DNA-based tools for accurate NIS detection and monitoring in marine and coastal ecosystems will very likely accelerate.
(NIS) in coastal ecosystems based on high-throughput sequencing tools" (PTDC/BIA-BMA/29754/2017). We are also grateful to two reviewers for comments and suggestions that improved the manuscript.