Phytool, a ShinyApp to homogenise taxonomy of freshwater microalgae from DNA barcodes and microscopic observations

Methods for biomonitoring of freshwater phytoplankton are evolving rapidly with eDNA-based methods, offering great complementarity with microscopy. Metabarcoding approaches have been more commonly used over the last years, with a continuous increase in the amount of data generated. Depending on the researchers and the way they assigned barcodes to species (bioinformatic pipelines and molecular reference databases), the taxonomic assignment obtained for HTS DNA reads might vary. This is also true for traditional taxonomic studies by microscopy with regular adjustments of the classification and taxonomy. For those reasons (leading to non-homogeneous taxonomies), gap-analyses and comparisons between studies become even more challenging and the curation processes to find potential consensus names are time-consuming. Here, we present a web-based application (Phytool), developed with ShinyApp (Rstudio), that aims to make the harmonisation of taxonomy easier and in a more efficient way, using a complete and up-to-date taxonomy reference database for freshwater microalgae. Phytool allows users to homogenise and update freshwater phytoplankton taxonomical names from sequence files and data tables directly uploaded in the application. It also gathers barcodes from curated references in a user-friendly way in which it is possible to search for specific organisms. All the data provided are downloadable with the possibility to apply filters in order to select only the required taxa and fields (e.g. specific taxonomic ranks). The main goal is to make accessible to a broad range of users the connection between microscopy and molecular biology and taxonomy through different ready-to-use functions. This study estimates that only 25% of species of freshwater phytoplankton in Phytobs are associated with a barcode. We plead for an increased effort to enrich reference databases by coupling taxonomy and molecular methods. Phytool should make this crucial work more efficient. The application is available at https://caninuzzo.shinyapps.io/phytool_v1/


I. Introduction
Freshwater phytoplankton constitutes a key-element in water biomonitoring and surveys are required by water policies (European Commission 2000). Different biomonitoring indexes, based on phytoplankton (Kaiblinger et al. 2009), have been developed through the last decades (e.g. Rimet 2012;Stevenson 2014;Laplace-Treyture et al. 2016). These indexes, mostly based on the abundance of phytoplankton species and their ecological profiles, aim to provide an assessment of ecological status of water bodies by using pressure-impact relationships, in particular, with nutrient concentrations (Birk et al. 2012). To assess and identify microalgae in water bodies, standardised protocols consist of microscopic counts on sedimented water samples (Utermöhl 1958;CEN 2006). These methods are time-consuming and require high-level taxonomic experts. Over the past decade, marker gene sequencing (metabarcoding) from environmental DNA has been shown to be an effective tool for biomonitoring applications targeting both micro-and macroscopic life (Pace 1997;Taberlet et al. 2012;Baird and Hajibabaei 2012;Creer et al. 2016;Deiner et al. 2017;Hering et al. 2018). Indeed, assessing phytoplankton through its DNA (and, more precisely, through specific DNA regions, called 'barcodes', Hebert et al. 2003) is both cost-and time-effective and does not require a specialist in taxonomy (however, skills in bioinformatics are required, but are easily accessible nowadays). Today, the implementation of this DNA-based approach is facing a situation with, on one hand, people using microscopy approaches, who can be reluctant to move to molecular techniques and, on the other hand, people using molecular approaches which have no expertise on microalgae taxonomy. To assess and identify microalgae, both techniques have their advantages and pitfalls and this point will not be discussed in this paper. However, the paper will describe the current Shin-yApp, "Phytool", which has been created with the goal to connect both approaches by making easier comparisons of data resulting from those two methods.
The rapid increase in the use of molecular techniques comes along with an amount of new DNA sequences (i.e. DNA barcodes) that are generally made available online (e.g. National Center for Biotechnology Information, NCBI https://www.ncbi.nlm.nih.gov/) in libraries (e.g. GenBank, Sayers et al. 2019). Only some of these DNA sequences benefit from expert taxonomical curation and can be found in curated reference libraries of DNA barcodes. In the case of phytoplankton species, some curated reference libraries are available as for example PhytoRef (del Campo et al. 2018); µgreen-db (Djemiel et al. 2020). More specific reference libraries exist for microalgae, such as Diat.barcode, an open-access curated barcode library for diatoms (Rimet et al. 2019). On the other hand, more general reference libraries are also available (e.g. PR2 (Guillou et al. 2013); SILVA -Quast et al. 2013;BOLD -Ratnasingham and Hebert 2007) which do not focus on phytoplankton taxa, but include part of them. The diversity of reference libraries, in combination with the rapid evolution in taxonomic names, often leads to conflicts amongst the different names used for the same species. Moreover, each reference library uses its own taxonomic nomenclature, making comparisons between them even more difficult. Establishing taxonomically-homogeneous lists of taxa observed by microscopy and through metabarcoding with these reference libraries is thus challenging. For example, Micractinium pusillum (Fresenius, 1858) is described in "Das Phytoplankton des Susswassers" (Huber-Pestalozzi et al. 1983), a reference book still used today for freshwater phytoplankton microscopic identifi-cations, as a species belonging to the class Chlorophyceae (Wille 1884); the order Chlorococcales (Marchand, 1895) and the family Micractiniaceae (G.M. Smith 1950). However, on AlgaeBase (Guiry and Guiry 2021), which is an online ref erence for microalgae, this species belongs (to date) to the class Trebouxiophyceae (Friedl, 1995), the order Chlorellales (Bold & M.J. Wynne, 1978) and the family Chlorellace ae (Brunnthaler, 1913). Thus, this species has moved in different taxonomic ranks through time and this phenomenon is not rare at all for freshwater phytoplankton. That is why taxonomic homogenisation is required to perform comparisons between data coming from molecular techniques and microscopy, but also to be able to compare the occurrences of taxa through the different existing DNA barcode libraries.
The proposed application, Phytool, is an innovative tool that enables users to homogenise taxonomic names collected from different types of files: DNA sequences in FASTA format (fulfilling some conditions, see §2.2.1 in Results section) for molecular biologists and simple dataframes for microscopists or taxonomists. Phytool uses the up-to-date taxonomy of freshwater microalgae as proposed in Phytobs (Laplace-Treyture et al. 2017), software designed to help people who make microscopic counts of phytoplankton for freshwater in the framework of lake or rivers monitoring. The Phytobs taxonomy is based on AlgaeBase and the most recent publications, with last update being made in May 2021. AlgaeBase is considered to provide the most complete and up-to-date taxonomy available for microalgae and was, thus, chosen as reference for the taxonomical homogenisation process. Phytool gathers different DNA barcodes, namely for the first release: rRNA16S; rRNA18S; rRNA23S. These barcodes, available for freshwater microalgae, were gathered from the curated databases cited above. The selection of these genetic markers has been established, based on investigations (e.g. literature review, reference libraries completion) and in silico tests made in the framework of a project founded by the OFB (Office Français de la Biodiversité). An in-depth investigation was done on different genetic markers to test their ability to target easily (i.e. primers available and their universality) and efficiently (high resolution) the whole diversity of freshwater phytoplankton communities in routine protocols.
Finally, Phytool scripts are open-access and gathered in a user-friendly ShinyApp interface with the goal to realise analyses easily for a broad range of users.

II.1. Application development and main instructions
Phytool is a Shiny Web Application, built with Rstudio (v.1.3.959), using the following R packages: BiocManager A user-friendly interface enables users to navigate easily through the different functionalities of Phytool. A complete tutorial is available in video format, providing more details and instructions to facilitate Phytool use. This tutorial can also be found directly in Phytool (see "Help" buttons or "About Phytool" tab). The different tab pages in the Phytool navigation bar and their functioning are discussed in more details in the following paragraphs.

II.2. Taxonomic homogenisation process
Within Phytool application, the tab "Homogenise taxonomy" allows users to upload files from the computer to homogenise and update the taxonomical names included in them. The input files can be FASTA files with DNA sequences (.fasta only) or data tables (.txt; .csv); more details about the specificities for each file types are provided in the corresponding section dealing with Phytool functionalities (see Results section). The reference used for the taxonomic homogenisation process is the data table displayed at the main page of the application ("Taxonomic browser" tab). Briefly, the process works as follows: in the uploaded files, the R algorithm looks for the pattern corresponding to both genus and species names in each row of the file. If the pattern is present in the reference database, then the ascendant taxonomy is changed (if the taxonomic rank is different from the one in input file) or added (if absent in the input file) with the one matching in the reference list. The current binomial names ('Genus species') can also be changed if they are not considered as the 'currently accepted names': for instance, if there is a more recent denomination (name has evolved through time) or if the name is unaccepted (i.e. nom. inval.; nom. illeg.; nom. rej.) and can be changed into an accepted name. If the provided 'Genus species' is not found in the Phytool reference list, then the taxonomic ranks associated and the name remain unchanged. Checkboxes allow then: (1) to keep (or not) the 'old' taxonomic names when an update occurs; (2) to keep (or not) only taxa matching with the reference list during the homogenisation process (i.e. present in Phytobs and thus selection of freshwater phytoplankton taxa only). An additional file (logfile) is also created and downloadable at the end of the process. It tracks the following modifications: 'Genus not found'; 'Ge-nus_species not found' and 'Current accepted name change' (see Figure 1 for details). The diagram in Figure 1 sums up the working process of the taxonomic homogenisation.

II.3. Phytool barcode library: data origin and curation process
To date, the molecular data added in Phytool v.1.0 come from curated reference barcoding libraries only. These are represented in Table 1.
After being downloaded from the web, the collected sequences (FASTA format) were re-arranged (on Linux terminal) in order to be comparable (identical FASTA format with same taxonomical ranks). A curation process (sche-  Step by step diagram of the taxonomic homogenisation process from the pattern recognition in input files to the output files creation. matic shown in Figure 2) has been applied for each database to avoid conflicts (i.e. different sequences associated to one species) and redundant taxa (i.e. a taxon with several identical sequences). For the same reason, if the barcode was covered by several different databases, the curation process was then applied to the full database (merging of the different libraries used for the target barcode). The sequences collected without conflicts were made available in the three Phytool rRNA barcode libraries, the others were not implemented in Phytool and were conserved for further investigation (see Perspectives and Conclusion).

III.1. General overview
Phytool application is available online at the following address https://caninuzzo.shinyapps.io/phytool_v1/. It allows free access with a user-friendly interface (the functioning is explained more in details ahead). The number of taxa per phyla and per barcode (16S, 18S, 23S) gathered in Phytool are summed up in Figure 3; the number of taxa per phyla available in Phytobs are also given. As a reminder, the number of sequences available in Phytool barcode libraries results from the curation process (detailed previously in "Methodology" section). Table 2 gives a summary of the number of sequences kept after this curation process. All the numbers supplied here (Table 2; Figure 3; etc.) are specific to the first version of Phytool and, thus, likely to change through next updates of the application. As shown by the pie chart (Figure 3), many taxa present in Phytobs suffer from a lack of barcode representation. This is particularly true, for instance, for the Bacillariophyta phylum in which more than 2000 taxa are registered in Phytobs and only 302 taxa have a 16S barcode (1137 and 36 for 18S and 23S, respectively). Another thing to notice is the huge proportion of barcodes that cannot be assigned at species level (i.e. unidentified species "sp.") as shown in the stacked barplot ( Figure 3). This explains that only about 25% of the species contained in Phytobs have an associated barcode.

III.2. Interactive functionalities
Sections below describe the interactive functionalities of Phytool application that are available through different tabs.

III.2.1. Taxonomic browser
The "Taxonomy browser" tab enables the display and download of the different species registered in Phytool and to check if DNA barcodes are available in reference  barcode libraries. An interactive table enables users to choose amongst different fields: the taxonomic ranks of the species, their potential synonym (i.e. potential other species name that is no longer accepted and refer to the current accepted 'Genus_species') and the different barcodes implemented in Phytool (SSU16S; SSU18S and LSU23S). Ticking a checkbox on the left panel will display the associated column on the table; it is also possible to select rows by clicking directly on them within the table (click again to remove selection).
The different fields are searchable in order to target species or lineages easily; finding a pattern within the complete table is also possible through search input at the top-right of the table. Finally, the download buttons on the left panel allow the download of the complete table (with current fields selected) or the download of only the current selection (fields selected and rows selected). The second option is possible only if at least one row is selected (it renders the button clickable).

III.2.2. Homogenise taxonomy
The "Homogenise taxonomy" tab is a key functionality of Phytool which allows users to homogenise (and update) the taxonomy from personal files. This can be done on FASTA files with DNA sequences or on data tables with taxonomy. The homogenisation process is restricted only to freshwater microalgae present in Phytobs (or related species). A Help button provides guidelines through a video tutorial, two other buttons enable the selection of the input file according to its format (FASTA or dataframe) and finally a submit button (which is disabled until a file is chosen). The input file should obviously respect some prerequisites to enable the pattern recognition process. Those conditions depend on the type of the data uploaded (see following subsections); however, whatever the input file selected, its size should not exceed 100 MB. If the prerequisites are not respected and/or the input files contain issues, then the process will not work and an error message will be displayed.

III.2.2.1. Uploading DNA sequences
The sequence files should be in FASTA format, with each sequence on a single row (not spread over multiple rows as is often the case for some formats of FASTA files). If it is not the case, the tool 'rearrange FASTA format', provided in Phytool to convert the file into the appropriate format (more details in §III.2.4.1), can be used. The field delimiters in the identifier lines should be semi-columns (";") or tabulations ("\t"). Other kinds of delimiters are not accepted; it is, thus, possible to replace them easily with the tool 'rearrange FAS-TA delimiters' also provided in Phytool ("Other tools" tab, more details in §2.4). Another essential point is to ensure that identifier lines end with the "Genus species" names. Finally, users need to pay attention to things, such as empty lines at the beginning/end of files or inappropriate lines in FASTA files which will lead to errors when using the application.

III.2.2.2. Uploading data tables
Prerequisites for data tables are less constraining than FASTA files. The provided data table just needs to contain a field called "Genus_species" in the header, inside which, the algorithm will look for patterns. Field delimiters can be semi-columns (";") or tabulations ("\t"), and can be specified when uploading the file. The table needs to be in an acceptable format (i.e. readable as a data.frame in Rstudio).

III.2.2.3. Output files
After processing the taxonomic homogenisation, two download buttons appear: one for downloading the input file with homogenised taxonomy and the second to get a logfile from the process. Additional checkboxes let users choose the content of the output file (default: no checkboxes are selected) and it is possible to combine different possibilities to download the desired output format. Users can choose to keep homogenised taxa only; in that case, other taxa (i.e. non-matching with Phytobs) will not be included in the output file. In addition to the updated taxonomic name, it is also possible to choose to keep the initial taxonomic name which will be provided in an additional field (ex: Genus_species). The application allows users to combine different possibilities through the checkboxes and download the output file in the desired format. Whatever the choices made with checkboxes, the logfile remains the same and tracks information, such as "Genus not found"; "Genus_species not found" and "Current accepted name".

III.2.3. Barcode reference libraries
The "Barcode libraries" tab displays the three different barcode reference libraries with the barcodes gathered in Phytool, which are (as a reminder) prokaryotic and eukaryotic small subunit ribosomal (16S and 18S, respectively) and prokaryotic large subunit ribosomal (23S). After selecting one of the three barcode reference libraries, the functioning of the interactive table is similar to the "Taxonomy browser". Amongst the different selectable (and searchable) fields provided, the original barcode reference library, in which the sequence was found, is available, as well as its original id number. The different taxonomic ranks, homogenised with Phytool, the potential synonyms and the size of the sequences (in base pairs) are provided. Users can choose to download the complete database or just a selection.

III.2.4. Other tools
Two functions have been implemented within the 'Other tools' tab: • the first one "rearrange FASTA format" enables the transformation of a FASTA file in which each sequence is spread over multiple rows to another FASTA file in which one sequence fulfils one row. The input FASTA file (with sequences spread over multiple rows) needs to be uploaded (its size should not exceed 100 MB). Thereafter, the submit button becomes clickable, the process of rearrangement is launched and a download button appears to save the transformed FASTA file. • the second one "rearrange FASTA delimiters" allows modifying the delimiter present in the identifier lines (starting by ">") of a FASTA file. After the upload of the FASTA file, the original delimiter (to modify) and the new delimiter (desired) can be provided. To use this function, follow recommendations given for "rearrange FASTA format".

IV. Perspectives and conclusion
The current application, described in this paper, is the first release of Phytool; it comes here as an innovative tool allowing to make easier some routine and time-consuming computer tasks for people working on freshwater phytoplankton. It aims to provide a common base for users, allowing a better comparability through the different studies, no matter the methodology used. Moreover, it gathers barcodes from different reference libraries which have benefited from another curation and can be downloadable in the format desired by users. Finally, some functionalities are also provided to reformat DNA sequences files (FASTA), which can be useful, especially for non-programmers.
Although it has been thoroughly tested, some issues may still occur. In case of issues/bugs, we encourage users to report them as explained in the tab "About Phytool", in order to improve the application.
The next release will mainly focus on enriching the barcode reference libraries by manual curation of the sequences rejected in this first release of Phytool from reference libraries. New barcodes will be implemented on Phytool in the future and these will also be deposited to the NCBI library. The project in which the current application was developed, focuses on the development of eDNA tools applicable for phytoplankton biomonitoring. We, therefore, selected specific barcodes within the two marker genes (rRNA16S and rRNA23S) allowing us to target the entire freshwater phytoplankton community. These barcodes will thus be enriched in the next releases of the application. Users who want to contribute in the enrichment process (for the same barcodes or other ones) are welcome to participate. The former versions will not be erased, but will remain accessible in order to conserve traceability (especially about the taxonomic updates which evolve through time). New functionalities which are widely used in bioinformatics are expected to be implemented in the next releases, such as the possibility to conduct in silico PCR over a selection of sequences. Other ideas can be found in the "Future perspectives" tab and ideas or suggestions from users are more than welcome as Phytool tends to be a participative web-based application to help people working on freshwater phytoplankton.