Ciliate SSU-rDNA reference alignments and trees for phylogenetic placements of metabarcoding data

Although ciliates are one of the most dominant microbial eukaryotic groups in many environments, there is a lack of updated global ciliate alignments and reference trees that can be used for phylogenetic placement methods to analyze environmental metabarcoding data. Here we fill this gap by providing reference alignments and trees for those ciliates taxa with available SSU-rDNA sequences derived from identified species. Each alignment contains 478 ciliate and six outgroup taxa, and they were made using different masking strategies for alignment positions (unmasked, masked and masked except the hypervariable V4 region). We constrained the monophyly of the major ciliate groups based on the recently updated classification of protists and based on phylogenomic data. Taxa of uncertain phylogenetic position were kept unconstrained, except for Mesodinium species that we constrained to form a clade with the Litostomatea. These ciliate reference alignments and trees can be used to perform taxonomic assignments of metabarcoding data, discover novel ciliate clades, estimate species richness, and overlay measured ecological parameters onto the phylogenetic placements.


Introduction
Phylogenetic placement is increasingly used to analyze environmental metabarcoding data. For example, it allows us to retain data that is deeply divergent from taxonomic reference databases (Mahé et al. 2017), perform taxonomic assignments (Bass et al. 2018;Jamy et al. 2020), and quantify the diversity distribution over a reference phylogeny (Barbera et al. 2021). Available phylogenetic placement programs include EPA-NG (Berger et al. 2011;Barbera et al. 2018), PPLACER (Matsen et al. 2010) and APP-SPAM (Blanke and Morgenstern 2020). These programs require a reference alignment and a reference tree. In ciliates, there is a recent reference alignment and tree for just the colpodean ciliates (Rajter et al. 2021) but recent global reference alignments and trees for all ciliates are lacking.
There are five global ciliate alignments and trees suitable for phylogenetic placement methods. All are based on the SSU-rDNA, containing the hypervariable V4 and V9 regions that are usually amplified in metabarcoding studies of ciliates. The first was published by Dunthorn et al. (2012b), which contained 175 ciliate sequences representing all 11 major ciliate clades, but outgroups were not included. The second was released a few years later in Dunthorn et al. (2014b), where the number of ciliate sequences was increased to 308, and there were two outgroups. The third and fourth are from Lynn (2008) and Gao et al. (2016), but the alignments and trees were not made publicly available. The fifth is from the EukRef-Ciliophora sequence database by Boscaro et al. (2018), which contained more than 1,200 ciliate sequences (including many unidentified environmental sequences). The identified sequences were not labelled based on the GenBank entries, and sequences with more than 97% similarity were discarded from the dataset. Since the publication of these reference alignments and trees, additional morphologically-identified sequences have been added to GenBank, and the ciliate classification was updated by Adl et al. (2019).
Here we provide updated ciliate reference alignments and trees designed to be used in phylogenetic placement analyses of environmental metabarcoding data. To do this, we: i) created SSU-rDNA dataset containing 478 ingroup and six outgroup sequences, ii) made three multiple sequence alignments with different masking of the ambiguously aligned nucleotide positions, and iii) constructed the maximum likelihood (ML) trees with constrained monophyly of main ciliate taxa based on the Adl et al. (2019) system.

Dataset with raw sequences
We enriched the ciliate nuclear SSU-rDNA alignment from Dunthorn et al. (2014) with subsequently-published ciliate sequences from GenBank (Clark et al. 2016), to create an updated dataset with 478 ciliate sequences and six outgroup sequences (Suppl. material 20: Table S1). This dataset contained only sequences from morphologically identified species; environmental sequences were not included, except Cariacothrix that only has environmental data in GenBank. Redundant sequences representing the same species were excluded, except two Phacodinium metchnikoffi populations were retained because this species creates an independent clade with an unknown position in the ciliate tree (Shin et al. 2000). We also included at least one sequence from each high-supported clade (bootstrap value ≥95%) found in trees from the recent ciliate class-specific phylogenetic studies (Suppl. material 1: File S1).

Alignments and masking
Multiple sequence alignments were built with MAFFT v7.471 (Katoh et al. 2019) using an iterative FFT-NS-i algorithm that always re-aligns the alignment after adding a new sequence (Katoh et al. 2002). Sequences were manually trimmed based on the universal eukaryotic SSU-rD-NA primers (Medlin et al. 1988) in BioEdit v7.2.5 (Hall 1999). Ambiguous alignment positions were masked using GBLOCKS v0.91b (Talavera and Castresana 2007) with the following block parameters: the minimum number of sequences for a flank position was set at 85%, the maximum number of contiguous non-conserved positions was fixed on eight, minimal block length was set on 10, and we allowed gap positions within the final blocks.
We created three multiple sequence alignments to compare the effects of various masking strategies (Suppl. material 21: Table S2). The first alignment had no masking of ambiguously aligned positions, containing extra phylogenetic information but possibly introducing misleading phylogenetic signal. The second alignment included masking for all ambiguously aligned positions, representing less informative but more biased-free alignment. And the third alignment also had masking for ambiguously aligned positions, except for the hyper-variable V4 region with a relatively high phylogenetic signal (Dunthorn et al. 2012a), representing a trade-off between losing phylogenetic informativeness and potentially biased results; the V4 region is typically used in metabarcoding studies of ciliates. All our reference alignments with corresponding trees can be found in the supplement (Suppl. material 2-Suppl. material 10).

Phylogenetic tree construction and constraints
Phylogenetic trees were inferred from all three alignments using RAXML v8.2.12 (Stamatakis 2014). The GTRGA-MMA model was preferably used as it's implemented in RAXML. To find a ML tree among all possible tree structures, we used the SPR tree-rearrangement. This rearrangement algorithm cuts a subpart of the tree and places it in another tree position, creating a new tree structure. We also used the constraint (-g) option in RAXML to force monophyly of all major ciliate subtaxa based on Adl et al. (2019). To compare the constraint's effect on the trees, we made an additional unconstrained tree from the first (unmasked) alignment (Suppl. material 11: File S11, Suppl. material 12: File S12, Suppl. material 13: File S13). For this tree, we performed 1,000 bootstrap replicates to find out the reliability of individual clades. Values <70% were considered low, 70-94% as moderate, and ≥95% as high (Hillis and Bull 1993). Trees were visualized and annotated in ITOL v5.5.1 (Letunic and Bork 2019).

Phylogenetic description of the reference trees
We forced the monophyly of all ciliate taxa based on the Adl et al. (2019) classification system in our constrained trees. To compare the effect of the constrained option on the overall topology, we also created an unconstrained tree from the unmasked alignment. The relationships between main ciliate taxa and higher grouping in our trees are described and compared with the literature below; the phylogeny within main taxa is described and discussed in Suppl. material 1: File S1.
All ciliate lineages clustered into two primary groups in all our trees. The first group, containing the Postciliodesmatophora clade, encompassed the Karyorelictea and Heterotrichea. This branching pattern is supported by previous morphological (Lynn 2008) and molecular (Gao and Katz 2014;Gentekaki et al. 2014) studies. Postciliodesmatophora branched off first and had moderate statistical support in our unconstrained tree, while the Intramacronucleata branched off the second but received only low statistical support. The position of Mesodiniidae outside of both main clades in the unconstrained tree is discussed in the incertae sedis part below.
The second group, containing the Intramacronucleata clade, encompassed the rest of the major ciliate taxa. Intramacronucleata was further splitting into two clades corresponding to SAL and CONthreeP groups. SAL contained the Spirotrichea, Armophorea (incl. Cariacotrichea, Muranotrichea, and Parablepharismea), and Litostomatea. Although SAL had only low statistical support in our unconstrained tree, its monophyly is highly supported in phylogenomic studies (Gentekaki et al. 2014;Lynn et al. 2018). Relationships inside SAL varied based on the alignment used. Litostomatea was sister to Armo-phorea+Spirotrichea in both unmasked alignments, while Armophorea was sister to Litostomatea+Spirotrichea in masked and V4 unmasked alignments. But the relationships within SAL are still poorly understood. Morphological and ontogenetic data supporting the Armopho-rea+Litostomatea grouping were presented in Vďačný et al. (2010). Statistical support from phylogenetic and phylogenomic molecular data is low (Lynn 2008;Gentekaki et al. 2014;Lynn et al. 2018).
The CONthreeP clade is the most diverse, containing six major ciliate groups: Colpodea, Oligohymenoporea, Nassophorea, Phyllopharyngea, Plagiopylea, Prostomatea (Adl et al. 2019). Though the monophyly of CONthreeP had only low statistical support in our unconstrained tree, it is consistently recognized in phylogenetic studies (Lynn 2008) and robustly supported in phylogenomic studies (Lynn et al. 2018). Regarding its intra-relationships, the late branching taxa were consistent in all our trees: Oligohymenophorea was sister to Prostomatea+Plagiopylea. This branching pattern was also recognized in other phylogenetic (e.g., Gao et al. 2016) and phylogenomic (e.g., Maurer-Alcalá et al. 2018) studies. But the relationships varied in the early-branching part of the CONthreeP subtree. Nassopho-rea+Phyllopharyngea branched off first, and Colpodea branched off the second in the constrained unmasked alignment. Colpodea+Nassophorea branched off first, and Phyllopharyngea branched off the second in masked and V4 unmasked alignments. Finally, Nassophorea branched off first, then Colpodea and Phyllopharyngea in the unconstrained tree.

Constrained monophyly of ciliate lineages
We constrained monophyly of the 11 main ciliate taxa based on the Adl et al. (2019) system that corresponds with a broad consensus from the recent phylogenetic and phylogenomic studies (Lynn 2008;Gentekaki et al. 2017;Lynn et al. 2018) as well as with the morphological view on ciliate phylogeny (Lynn 2003). The classification we used is summarized in Fig. 1, and all taxa with the corresponding classification are listed in Suppl. material 20: Table S1. Below, we discussed some of the taxa that are not strictly monophyletic in all SSU-rDNA phylogenies or based on the phylogenomic data.
One of these problematic taxa is the Nassophorea. Recent molecular studies have inferred this taxon as non-monophyletic, because its two subtaxa, Microthoracida and Nassulida, have not grouped together (Gong et al. 2009;Fan et al. 2014;Zhang et al. 2014;Pan et al. 2019). But as the definitive phylogenetic position of microthoracids and nassulids has yet to be resolved, we followed the Adl et al. (2019) system here. Thus, we forced Nassophorea to be monophyletic, containing Microthoracida and Nassulida together.
Spirotrichea represents a species-rich ciliate taxon with some taxonomically questionable members (Shin et al. 2000;Lynn and Strüder-Kypke 2002). One of these problematic spirotrichean taxa is the Licnophoridae, containing mostly symbionts of marine animals (Lynn 2008). Licnophoreans were assigned to Spirotrichea because they cluster in its vicinity in SSU-rDNA phylogenies, and because they also possess some typical spirotrichean morphologies (Lynn and Strüder-Kypke 2002). On the other hand, there are molecular data only for two species, and their phylogenetic position has not been thoroughly evaluated. For example, in our unconstrained tree the family Licnophoridae branched as sister to Armopho-rea+Spirotrichea, i.e. outside of the class Spirotrichea as in the other studies Gao and Katz 2014;Chen et al. 2015). Nevertheless, we follow (Adl et al. 2019) classification here, where licnophoreans are assigned within Spirotrichea, and we forced this grouping in all our constrained trees.
The taxon Armophorea is also non-monophyletic in SSU-rDNA phylogenies (da Silva Paiva et al. 2013;Li et al. 2017;Fernandes et al. 2018;Rotterová et al. 2018), as it contains family Caenomorphidae that is typically branching outside of Armophorea but always somewhere in the Spirotrichea+Armophorea+Litostomatea (SAL) group. In our unconstrained tree, caenomorphids branched as sister to the lineage Litostomatea, as in (Li et al. 2017) and (Rotterová et al. 2018). But as the phylogenetic position of caenomorphids is unstable based on the SSU-rDNA, we kept this family within the Armophorea in all our constrained trees.
The taxon Cariacotrichea was created based on the environmental sequences from the anoxic Cariaco Basin's deep-sea waters in Venezuela (Stoeck et al. 2003b;Edgcomb et al. 2011;Orsi et al. 2012). Cariacotrichea is sometimes referred to as a separate class. Still, as more data are needed for this group, most of the ciliate classification systems considered cariacotricheans as incertae sedis clustering somewhere in SAL (Adl et al. 2019). Two other newly established lineages in the vicinity of Armophrea are Parablepharismea and Muranotrichea (Rotterová et al. 2020). In our trees, all three lineages -Cariacotrichea, Parablepharismea, and Muranotricheaalways grouped together creating a sister clade to Armophorea as in Rotterová et al. (2020). For simplicity, we decided to label all these clades as Armophorea in our reference trees, which corresponds to the APM clade defined in Rotterová et al. (2020). For the detailed taxonomical information, please see Suppl. material 20: Table S1.
Although most of our classification followed Adl et al. (2019), an exception is the family Mesodiniidae (Mesodinium pulex and Myrionecta rubra) that we assigned to the taxon Litostomatea and not as an incertae sedis taxon. Mesodiniidae created a separate long-branched clade that clustered between all ciliates and outgroup in our unconstrained tree and other molecular studies (Dunthorn et al. 2014b;Gao et al. 2016). The problematic phylogenetic placement of mesodiniids is caused by the different primary structure of the mesodiniids' SSU-rDNA sequence compared to the typical ciliate SSU-rDNA sequence. But based on the phylogenomic data, Lasek-Nesselquist and Johnson (2019) discovered that mesodiniids have a close affinity to Litostomatea corresponding to the Lynn (2008) classification. They showed that the highly derived mesodiniids' SSU-rDNA sequence results from stolen prey nuclei dominating the cell expression. In light of this new evidence, we decided to constrain the Mesodiniidae to be in the Litostomatea.

Phylogenetic position of some ciliate incertae sedis taxa
Despite the recent advances in sequencing techniques and more molecular data available, the phylogenetic position of some ciliate taxa is still uncertain. We still wait for more data from these enigmatic lineages that we labelled as incertea sedis for now. We discuss below those taxa one by one and describe their evolutionary position in our reference trees compared to results from other phylogenetic studies.
Protocruziidae (Protocruzia adherens and Protocruzia contrax) represented a separate lineage within the Intramacronucleata in all our trees (Fig. 1). Protocruziidae have been formerly assigned to the lineage Spirotrichea (Lynn 2008), but its phylogenetic position is unstable in molecular studies and mainly creates independent clade within Intramacronucleata based on single-gene (Li et al. 2010) and also phylogenomic (Gentekaki et al. 2014) studies. This lineage is placed as incertae sedis within Intramacronucleata in the Adl et al. (2019) classification as well.
Phacodiniidia represented by Phacodinium metchnikoffi sequences from the Korean and China populations created a separate lineage clustered within SAL lineage as sister to Spirotrichea or as a sister to Litostomatea in our reference trees. Although Phacodiniidia represents a spirotrichean subgroup according to (Lynn 2008), only limited molecular data are available for this taxon, and its systematic positions have never been satisfactorily resolved. Therefore, this taxon is assigned as incertea sedis within the SAL group in Adl et al. (2019) and in our reference trees, too (Fig. 1).
Although Askenasia sp. had a stable phylogenetic position within the lineage Prostomatea in all our reference trees, this taxon represents incertea sedis in Adl et al. (2019). The phylogenetic position of Askenasia was thoroughly studied in (Liu et al. 2015), where this genus cluster close to Prostomatea, Plagiopylea, or Oligohymenophorea based on SSU-rDNA phylogeny. But considering morphology, Liu et al. (2015) assumed that Askenasia is indeed a prostomatean, as our reference trees also indicated (Fig. 1). The Paraspathidium apofuscum sequence was also placed within the lineage Prostomatea in all our reference trees despite this genus having the incertae sedis status in Adl et al. (2019).
Cyclotrichium cyclokaryon and Pseudotrachelocerca trepida formed a close relationship in all our trees, clustering always inside the CONthreeP group. Both taxa are also incertae sedis in Adl et al. (2019) within CON-threeP and grouping near Plagiopylea. Another incertae sedis taxon -Discotrichidae -was also placed within the CONthreeP group in all our reference trees. The family Discotrichidae was previously included in the Microthoracida (class Nassophorea) (Lynn 2008), but after inferring its SSU-rDNA phylogeny (Fan et al. 2014;Zhang et al. 2014), this taxon was assigned as incertae sedis with-in CONthreeP (Adl et al. 2019), as our reference trees showed as well (Fig. 1).

How to use our reference alignments and trees in the ciliate diversity research
Phylogenetic placement of metabarcoding data allows researchers to ask broader evolutionary and ecological questions. But without taxonomical expertise on specific taxa, interpretation of these placements is difficult. Here we offer suggestions on some of these interpretations for ciliates using our reference alignments and trees. First, the user can determine the taxonomic position of the investigated environmental sequences as during phylogenetic placement is each sequence attached to the reference tree. The exact phylogenetic positions for all placed sequences can be retrieved using the taxonomical assignment command in the GAPPA package (Czech et al. 2020). In this way, it is possible to taxonomically assign all the investigated environmental sequences, even those that are distantly related to known species and are challenging to assign using standard taxonomic assignment methods (Koski and Golding 2001). The comparison between standard taxonomical assignments and phylogenetic placements is discussed in more detail by Czech and Stamatakis (2019) Second, the user can discover novel ciliate clades. Despite a long history of the ciliate investigation, novel ciliate clades outside of the main ciliate taxa could be still found in understudied environments such as the anoxic deep-sea waters (Stoeck et al. 2003b;Orsi et al. 2012;Rotterová et al. 2020) or geographical regions such as subtropical and tropical areas (Fernandes et al. 2020). Novel environmental clades within the main ciliate taxa will likely be detected in our reference trees. Many environmental clades have been recovered within almost all main ciliate taxa using a global ciliate tree in Boscaro et al. (2018). Also, numerous unique ciliate environmental sequences have been reported in recent metabarcoding studies from marine and freshwaters habitats (Gimmler et al. 2016;Canals et al. 2020;Fernandes et al. 2020;Zhao 2021). As phylogenetic placements combined with reference alignments will determine the phylogenetic position and the close relatives of the novel environmental clades, molecular ecologists could use this information to assign ecological characteristics of the potential species in the given environmental clade. Based on the environmental sequences, it is possible to create specific phylogenetic probes for fluorescence in situ hybridization (FISH) and link morphological and molecular data in a further investigation (Stoeck et al. 2003a;Gimmler and Stoeck 2015).
Third, the user can estimate the total number of ciliate species and their subtaxa. This is possible by using the assign subcommand in the GAPPA package (Czech et al. 2020) that will taxonomically assign placed query sequences into the predefined taxonomy that needs to be provided as an input file. We prepared two different predefine taxonomy input files. The first one contains only assignment to the main ciliate taxa and the outgroup (Suppl. material 14: File S14). Using this input file, it is possible to determine the exact number of placed queries sequences inside and outside of Ciliophora as well as in each main ciliate taxon. The second input file (Suppl. material 15: File S15) includes more detailed system that is based on Adl et al. (2019) classification. Using this input file, it is possible to get detailed taxonomical resolution of the placed query sequences.
Ciliate species diversity has mostly been estimated using morphological differences to define species boundaries. Foissner et al. (2008) estimated that 83-89% of the ciliate diversity is undiscovered and assumed that the total number of ciliate species is between 27,000 and 40,000. On the other hand, Finlay et al. (1998) argued that global ciliate diversity is relatively low, consisting of about 3,000 species. Although recent molecular studies discriminate species based on percentual sequence similarity, these studies often mirror the results obtained from morphological investigations of the same samples (Stoeck et al. 2018;Forster et al. 2019;Pitsch et al. 2019). Molecular ecologists can therefore evaluate the morphological-based estimations of the ciliate species diversity using environmental HTS data combined with our alignments.
Fourth, the user can correlate phylogenetic placements with measured environmental parameters using the edge correlation command in GAPPA. In this way, it is possible to combine environmental parameters with phylogenetic placements and determine which environmental factors influence ciliate community composition.

Conclusions
As the phylogenetic placement methods are increasingly used in metabarcoding studies, here we provided global ciliate reference alignments and trees. We designed and formatted these reference alignments and trees specifically for the ciliate diversity research and phylogenetic placement demands as there are no such available datasets. These files are easily downloadable from the online supplement (Suppl. material 2-Suppl. material 10).