18S rDNA amplicon sequence data (V1–V3) of the Palmyra Atoll National Wildlife Refuge, Central Pacific

To address the global biodiversity crisis, standardized data that are rapidly obtainable through minimally invasive means are needed for documenting change and informing conservation within threatened and diverse systems, such as coral reefs. In this data paper, we describe 18S rRNA gene amplicon data (V1–V3 region) generated from samples collected to begin characterizing coral reef eukaryotic community composition at the Palmyra Atoll National Wildlife Refuge in the Central Pacific Ocean. Sixteen samples were obtained across four sample types: sediments from two sieved fractions (100–500 μm, n = 3; 500 μm-2 mm, n = 3) and sessile material scrapings (n = 3) from Autonomous Reef Monitoring Structures (ARMS) sampled in 2015, as well as seawater from 2012 (n = 7). After filtering and contaminant removal, 3,861 Amplicon Sequence Variants (ASVs) were produced from 1,062,238 reads. The rarefaction curves demonstrated adequate sampling depth, and communities grouped by sample type. The dominant orders across samples were polychaete worms (Eunicida), demosponges (Poecilosclerida), and bryozoans (Cheilostomatida). The ten most common orders in terms of relative abundance comprised ~60% of all sequences and 23% of ASVs, and included reef-building crustose coralline algae (CCA; Corallinophycidae) and stony corals (Scleractinia), two taxa associated with healthy reefs. Highlighting the need for further study, ~21% of the ASVs were identified as uncultured, incertae sedis , or not assigned to phylum or order. This data paper presents the first 18S rDNA survey at Palmyra Atoll and serves as a baseline for biodiversity assessment, monitoring, and conservation of this remote and pristine ecosystem.


Introduction
Tropical coral reefs are among the most biologically diverse, complex, and productive of ocean ecosystems (Knowlton et al. 2010). However, they are also highly endangered habitats, with one-third of reef-building corals under extinction threat (Carpenter et al. 2008;Knowlton et al. 2010). Reef systems are declining due to pollution, overexploitation, global climate change, bleaching, ocean acidification, changing storms, and other factors (Carpenter et al. 2008;Knowlton and Jackson 2008;Sandin et al. 2008). Yet, basic understanding of taxa being affected is limited due to a combination of collection biases and challenges, lack of taxonomic expertise, and labor-intensive morphological evaluations, leading to uncertain species identifications and biodiversity estimates (Plaisance et al. 2011;Leray and Knowlton 2016).
To address this gap, baseline biodiversity assessment using standardized methods to enable comparison among sites represents a fundamental first step Leray and Knowlton 2016;Lafferty et al. 2021). While surface-dwelling taxa such as fish and corals can be visually and morphologically assessed, microscopic organisms and the small and cryptic groups living within the reef framework are generally more readily surveyed using next-generation metabarcoding technologies (Knowlton et al. 2010;Plaisance et al. 2011;Leray and Knowlton 2016;Servis et al. 2020;Deiner et al. 2021). High-throughput community-wide metagenomic approaches provide orders of magnitude more data than morphological methods or traditional DNA barcoding of individuals, identifying multiple organisms simultaneously from living biomass or environmental DNA (eDNA) in water or sediment (Taberlet et al. 2018;Deiner et al. 2021). Although there are limitations due to incomplete reference databases and other factors, this analysis is minimally destructive, less reliant on taxonomic expertise, increasingly cost-effective, and capable of identifying elusive, rare, endangered, and invasive taxa. Furthermore, it is especially useful for sampling remote areas that are difficult or costly to access (Kelly et al. 2017), like the Palmyra Atoll National Wildlife Refuge (PANWR) (Fig. 1).
Targeted metabarcoding research at Palmyra has established baselines and patterns for fish (Lafferty et al. 2021), prokaryotes, and viruses Smriga et al. 2010;McDole et al. 2012). DNA barcoding studies there have uncovered crustacean and brachyuran richness from Pocillopora coral heads (Plaisance et al. 2009;Knowlton et al. 2010) and sampling units known as ARMS, or Autonomous Reef Monitoring Structures (Leray and Knowlton 2015;Ransome et al. 2017;Pearman et al. 2018;Nichols et al. 2021). ARMS are composed of PVC plates stacked in a tier of alternating open and semi-closed layers and affixed to the reef benthos, where they attract organisms by mimicking structural complexity (Ransome et al. 2017). ARMS capture poorly understood cryptofauna (e.g., bryozoans, polychaetes, sponges, tunicates) and sediments containing microscopic taxa, secretions, excretions, shed cells, and other sources of eDNA. As such, ARMS constitute a standardized, replicable method for assessing both the metazoan diversity of coral reefs and their broader associated eukaryotic communities, and are increasingly deployed globally (Leray and Knowlton 2015;Ransome et al. 2017;Pearman et al. 2018;Nichols et al. 2021).
The data set presented here constitutes the first broad eukaryotic biodiversity survey of coral reef systems at the Palmyra Atoll National Wildlife Refuge. Both seawater and ARMS samples were examined using 18S rDNA, a promising genetic marker offering advantages of broad amplification across eukaryotic kingdoms, a rapidly growing reference database, and wide use (Taberlet et al. 2018). In this work the V1-V3 segment, with demonstrated utility for uncovering a broad range of aquatic taxa (Ingala et al. 2021), was sequenced. Our analysis of preliminary data showed that V1-V3 captured groups of interest, including corals and algae, while also contributing information about a less frequently sequenced 18S gene region. The versatility of the 18S PCR primers represents a convenient tool to screen the entire eukaryotic domain, providing a valuable baseline assessment of eukaryotic life in this relatively undisturbed coral reef ecosystem, and a significant contribution to the global effort to characterize reef biodiversity.

Study site
Palmyra Atoll is located in the Central Pacific Ocean approximately 1,700 km southwest of Hawai'i, lying on the northwest end of the Northern Line Islands ( Fig. 1; 5°53'N; 162°5'W). The PANWR, currently uninhabited except for research and management staff, includes about 2.5 km 2 of land area and 155 km 2 of reefs, and comprises small islands and islets surrounding western, central, and eastern lagoons (Collen et al. 2009). While U.S. military activities in the 1940s increased the area and volume of the islands, resulting in sedimentation and decreased water level and flow in the lagoons, the surrounding shallow reef flats and submerged reef terraces to the east and west remained relatively undisturbed (Collen et al. 2009;Sterling et al. 2013).

ARMS collection and DNA extraction
Three sites around the atoll were selected for ARMS deployment (Fig. 1). In May 2012 NOAA researchers placed one ARMS unit on the forereef at each of these sites, at depths of 10.7-18.3 m. The monitoring structures were secured to the hard bottom reef habitat using stainless steel stakes and heavy-duty zip ties, where they remained for three years. Upon recovery in 2015, each ARMS was encapsulated with a mesh-lined crate, detached from the reef, brought to the surface, transferred to a disassembly tub, and transported to the NOAA ship where it was taken apart plate by plate.
Once disassembled, the seawater within the tub was sieved through adjoining 2 mm and 500 μm pans and an attachable 100 μm mesh net, creating two separate sediment size fractions (100-500 μm and 500 μm-2 mm). These fractions were stored in 95% ethanol within a -20 °C freezer. Back at a NOAA lab onshore, each sediment fraction underwent a decantation process to isolate organic material from sediment particles following Leray and Knowlton (2015). Resulting organic material was stored in 95% ethanol until being sent out for DNA extraction.
Plates were scoured using a paint scraper to remove the sessile material. Scrapings were transferred into a blender and homogenized for one minute. The blended homogenate was poured into a 40 μm hand net and drained with the aid of filtered seawater. Approximately 10 g subsamples were placed into 50 ml falcon tubes containing 95% ethanol and stored at -20 °C until being shipped for DNA extraction.
At the Sackler Institute for Comparative Genomics, four replicate extractions (~ 0.25 g each) were performed using the MOBIO PowerSoil DNA Isolation Kit on each of the ARMS sediment and sessile samples. Although extraction blanks and positive controls were not used, standard decontamination, disinfection, and sterilization practices stringently in place at this dedicated molecular laboratory, including use of PCR-free areas, were assiduously followed, and in silico contaminant removal was carried out as described below. The DNA eluates from the four replicates were pooled (combined and mixed into one tube) and stored at -20 °C prior to being sent out for sequencing.

Water collection and DNA extraction
Water samples (n = 7) were collected in May 2012 from seven forereef sites < 1 m above the benthos at depths ranging from 3.7-13.7 m (Fig. 1). Following established collection and processing protocols, at each site, 5 L of reef water were obtained and passed through a 20 μm Nitex, followed by a 0.22 μm Sterivex filter (2 filters, ~ 2.5 L each) (Haas et al. 2014). Once filtered, samples were stored at -20 °C and shipped to a dedicated eDNA lab at San Diego State University for DNA isolation following Haas et al. (2014).

DNA sequencing
Sixteen samples, 7 from water and 9 from the three ARMS (two sediment fractions and one sessile material scraping sample per unit) (Fig. 1, Table 1), were quantified using a Qubit Fluorometer according to manufacturer instructions (Thermo Scientific, USA) and then sent to MRDNA (www. mrdnalab.com, Molecular Research LP, Shallowater, TX, USA) for purification, PCR amplification, sequencing, and analysis using standard and previously described procedures (Dowd et al. 2008;Ingala et al. 2021). First, preliminary runs were conducted and examined, and once deemed successful for satisfactory amplification of organisms of interest (e.g. plankton, corals, and algae), all of the remaining samples were submitted for processing. The DNA underwent PCR amplification targeting a 272-bp segment from the V1-V3 region of the 18S rRNA gene using the primers Euk7F (Medlin et al. 1988) tagged with a unique 8-bp identifier barcode, and Euk570R (Weekers et al. 1994). Each of the ARMS samples had a replicate sequenced, for a total of 25 PCR reactions. These reactions were carried out in 20 μl total volume containing 10 μl of the QIAGEN HotStarTaq Plus Master Mix Kit (Qiagen, USA), 9 μl of water, and 1 μl of template (Qubit 1/200 concentration range: 5.1 -76.8 ng/ mL). The PCR program consisted of an initial denaturation of 3 min at 94 °C followed by 28 cycles of 30 s at 94 °C, 40 s at 53 °C, and 60 s at 72 °C, and a final extension of 5 min at 72 °C. PCR product quality was assessed by gel electrophoresis in 2% agarose gels, and amplification success and proportions for sample pooling were informed by the band length and intensity. The uniquely tagged PCR products were pooled together in equal proportions based on DNA concentrations and electrophoresis band characteristics, and purified using Ampure XP beads (Agencourt Bioscience, USA) with a ratio of beads to PCR products of 0.75× following Illumina guidelines. They were then ligated to Illumina adapters using the Illumina TruSeq protocol, and sequenced on an Illumina MiSeq platform using a paired-end 2 × 300bp procedure (Dowd et al. 2008;Naro-Maciel et al. 2020;Ingala et al. 2021).

Demultiplexing, sequence filtering, and quality control
Forward and reverse reads extracted from MRDNA's Fastq Processor (MRDNA 2021) were imported into the QIIME2 v. 2019.1 pipeline (Bolyen et al. 2019), demultiplexed as paired-end sequences, and examined for overall read quality. While the quality of forward reads was generally high, reverse reads varied across runs, and as a result, only forward reads were used. The DADA2 plugin in QIIME2 (Callahan et al. 2016) was employed in single-read mode to denoise and remove chimeric sequences (including discarding singleton sequences) and to identify amplicon sequence variants (ASVs), using the default value (2) for the maximum number of expected errors, a 230-bp read length, and trimming the first 30 bases. Default values were also used for removing chimeras and sequences of poor quality and for all other parameters (QIIME2 2021). All scripts used in this paper are publicly available at https://github.com/nerdbrained/ palmyra_edna.

Taxonomy assignment
Amplicon-specific reference databases were used to obtain more robust taxonomic classifications (Bokulich et al. 2018). A primary database was created with the RESCRIPt QIIME plugin (Robeson et al. 2021) using sequences from the curated SILVA 138 rRNA reference database (Quast et al. 2013). Reference sequences were removed if 5 or more ambiguous bases or homopolymer runs greater than 8-bp in length were detected. Remaining sequences were then filtered to include only eukaryotes. From this pool, amplicon-specific reads were extracted from the filtered SILVA 138 database using the forward and reverse sequencing primers, and any replicate segments present after filtering were removed. Finally, a naïve Bayes classifier was fit to the reference sequence database. Sequences were aligned and a phylogenetic tree was created using the denoised forward sequences with the align-to-tree-mafft-fasttree command. The Bayes classifier was used to assign a taxonomic identification to each ASV in the tree using the classify-sklearn command in QIIME2.
To improve taxonomic assignments, two alternate assignment methods were also used. For the first method, the steps above were repeated employing an alternate curated 18S database with a focus on planktonic sequences (pr2 v.4.14.0) (Guillou et al. 2013, del Campo et al. 2018. For the second method, a sequence search was conducted against the NCBI database (downloaded 1/27/22) using the blastn algorithm with default parameters in BLAST+ v.2.11.0. (Camacho et al. 2009). BLAST hits were then used to assign sequences to taxa using the LCA-assignment algorithm in MEGAN Community Edition v.6.2.17 (Huson et al. 2016). Assignments from these methods were compiled for all sequences that did not receive a designation at the order level using the SILVA database. If either alternate method provided an assignment at the order level, that assignment was substituted in for downstream analyses. If the pr2 and BLAST identifications conflicted, the assignment from the pr2 database was preferentially retained as this database is actively curated by taxonomic experts.

Contaminant identification and removal
The QIIME2 outputs were imported into R v.4.0.0 (R Core Team 2020) using the R package PHYLOSEQ (McMurdie and Holmes 2013). All R code employed in this work can be found at https://github.com/nerdbrained/ palmyra_edna. Potential contaminants were identified and subsequently removed from the dataset in silico using the R package DECONTAM (Davis et al. 2018). This option was selected because lab procedures were conducted prior to the widespread use of extraction blanks and sequencing of PCR negative controls in metabarcoding studies. The program's frequency method flagged taxa whose proportions were inversely correlated with sample DNA concentration across all samples. A conservative threshold value of p < 0.10 was employed to identify potential contaminants, and any taxa meeting this criterion were removed from the dataset. After contaminant removal, the Table 1. Amplicon Sequence Variant (ASV) diversity measurements for each sample type. Included are the total number of samples, sequences retained after filtering, and the total and mean numbers of ASVs detected per sample. Number of (unique) ASVs, Shannon-Weaver and Faith's phylogenetic diversity for each sample type, before and after rarefaction to the lowest combined number of reads per sample type (152,590), are given to the left and to the right of slashes respectively. ARMS duplicate samples were merged, as these were not true biological replicates (Table 1).

Community composition
To assess if the number of sequences was appropriate for accurately estimating community composition and taxonomic diversity, rarefaction curves for each sample were produced using the R package VEGAN v. 2.5.5 (McMurdie and Holmes 2013; Oksanen 2019). Variation in community composition among sample types and collection sites was analyzed using a Jaccard distance matrix based on presence/ absence within a permutational analysis of variance (PERMANOVA) with 999 permutations. Presence/ absence rather than abundance-based analyses are presented here, as sequence and naturally occurring organismal frequencies do not necessarily match. However, abundance-based analyses using Bray-Curtis distance gave similar results (data not shown). Before calculating Jaccard distance, sequence abundance data were converted to presence/absence information using the function vegan_stand in the R package QSRUTILS (Zhang et al. 2017). Homogeneity of dispersion among groups was tested using the betadisper function in VEGAN. Community composition was primarily visualized using principal coordinate analysis (PCoA).
The number and percent of taxa shared and unique to each sample type were visualized using quasiproportional Venn diagrams of ASVs in the R package NVENNR (Quesada 2020).

Taxonomic and phylogenetic diversity
Sequences were grouped by sample type, the total number of ASVs was calculated, and the number of taxa was estimated at the second-and fourth-highest taxonomic levels present in the SILVA138 database. These levels are roughly consistent with conventional "phylum" and "order" classifications, respectively, with the caveat that taxonomic ranks in SILVA are assigned to preserve roughly the same level of evolutionary divergence at a given rank across the tree (Yilmaz et al. 2014). As such, some groupings reported here are clades encompassing several related orders. The proportion of sequences from each sample type belonging to each identified order was calculated, and these proportions were summed across sample types to identify the ten most common orders in the dataset. The Shannon-Weaver diversity index and ACE richness estimate were calculated using VEGAN for individual samples, pooled sample types, and the collective dataset. Faith's phylogenetic diversity (defined as the sum of all branch lengths in the phylogenetic tree for all samples in a given group) was calculated at the sample type levels and for the entire dataset with the R package PICANTE (Kembel et al. 2010). A Kruskal-Wallis test was used to determine whether differences in diversity among sample types were larger than expected under a null hypothesis of no difference. To explore the effect of varying read number on diversity metrics, all statistics were then re-calculated after rarefying with replacement to the lowest read depth for the site-and sample type-level analyses. To determine how inclusion of unicellular eukaryotes influenced each metric, diversity metrics were also calculated for each sample without rarefaction using only ASVs assigned to metazoan phyla. Visual representations of the abundance of the top 10 orders and the top 25 orders were created using a bar plot and a heat map, respectively, generated in PHYLOSEQ.

Results and discussion
This project generated promising baseline data for characterizing overall eukaryotic diversity of the remote and relatively pristine PANWR reef community, and for contributing to the ongoing global assessment of reef biodiversity. A total of 1,610,301 raw sequences were generated, with on average 100,000 sequences per sample (range: 54,090-190,997). After denoising and sequence filtration, 1,113,657 (70.5%) sequences remained, resulting in 3,936 ASVs. As noted above, 75 ASVs (51,419 sequences) were removed as putative contaminants, leaving 1,062,238 sequences and 3,861 ASVs for analysis (Suppl. material 3: Table S1).
Rarefaction at the sample level (n = 16) indicated that sequencing depth was sufficient for characterizing biodiversity, as ASV accumulation curves reached an asymptote in each case (Fig. 2). In all instances, rarefied diversity statistics were highly similar to measures calculated without rarefaction, indicating that variation in read number among samples and sample types did not account for substantive differences in diversity (Table 1; Suppl. material 4: Table S2). Dispersion did not differ significantly among sample types (p > 0.05). Community composition varied among sample types (PERMANOVA; p = 0.001) but not among sites (PERMANOVA; p > 0.1). PCoA ordination showed clustering of ARMS sample types, with the two sediment fractions showing the greatest similarity to one another (Fig. 3). Only 0.1% of the ASVs were found within all sample types, and 88.4% were unique to a single sample type, with roughly 72.9% of the unique ASVs found within the water samples (Suppl. material 1: Fig. S1).
Taxonomic composition consisted of 73 different phyla and 261 orders (Suppl. material 5, 6: Tables S3, S4). The pr2 and BLAST identification methods provided orderlevel identifications for over 400 ASVs that were initially unidentified using the SILVA database (Suppl. material 6: Table S4). However, even after using multiple assignment methods, nearly 14% of the ASVs remained identified as either uncultured, incertae sedis, or were not assigned a phylum, and an additional 7.5% were not assigned to order. The unidentified ASVs represented a relatively low number of sequences (67,932 sequences, or 6.4% of the total sequences retained after filtering), and over 80% Atoll. Sample types include 100-500 μm sediment, 500 μm -2 mm sediment, sessile material scrapings, and reef water. Given the uneven sampling depths, for some comparative analyses samples were later rarefied to even sequencing depth. Numbers indicate sampling site as shown in Fig. 1. of these sequences were detected in water samples. This finding underscores the needs both for further research and more comprehensive reference databases. Among ASVs identified to order, few orders (mostly metazoan) had high relative abundance (proportion summed across sample types > 0.1), while many had lower relative abundance (with proportion summed across sample types < 0.1; Fig. 4; Suppl. material 3, 5, 6: Tables S1, S3, S4). For example, the top 10 orders by relative abundance were represented by 534,776 sequences and 784 ASVs, comprising ~50% of all sequences and ~20% of ASVs. However, given documented PCR amplification biases, estimates of relative abundance should be interpreted with care (Fonseca 2018).
The most common groups across samples were polychaete worms (order Eunicida), followed by demosponges (order Poecilosclerida) and bryozoans (order Cheilostomatida). Order Podocopida (ostracod crustaceans) was the fourth most frequent, the fifth was the dinoflagellate order Syndiniales, and the sixth was the Peracarida crustaceans (a group containing amphipods and isopods, ranked as an order in the SILVA 138 taxonomy but classified as a superorder in other taxonomies). The Corallinophycidae red algae, which contains the order Corallinales (crustose coralline algae, CCA), was the seventh most common. Order Scleractinia, or stony corals, was the eight most frequent, followed by another demosponge order (Dendroceratida) and Calanoida (a marine copepod group). Notably, sequences from two orders critical to healthy reefs (Corallinophycidae and Scleractinia) were detected in both water and sessile ARMS samples. Corallinophycid genera identified by eDNA included Amphiroa, Hydrolithon, Jania, Lithothamnion, Mesophyllum, Neogoniolithon, Porolithon, Sporolithon, Titanoderma, and the scleractinian coral genera Acropora, Favites, and Montipora were also detected (Suppl. Material 3, 5: Tables S1, S3). Coral symbiotic zooxanthellae (genus Symbiodinium), which are ejected when corals bleach, were also found, mainly in water but also in the sessile and coarse sediment fractions of the ARMS samples.
the different communities recovered from metabarcoding seawater and ARMS from the same location, metazoan diversity between multiple sample types, including combined ARMS biomass and seawater, has been compared previously and confirmed higher metazoan diversity in the combined ARMS biomass (Nichols et al. 2021). In our work, the higher overall diversity and uniqueness of seawater compared to the combined ARMS biomass is dependent on inclusion of unicellular eukaryotes, as dinoflagellates and other unicellular organisms made up the majority of the seawater sequences and ASVs. However, caution is warranted as the water samples were collected in 2012 and the ARMS samples in 2015.
Among the ARMS samples, the sessile material scraping and 100-500 μm sediment fractions tended to have higher diversity than the 500 μm-2 mm ones (Table 1). Sequences from the sessile fraction represented mostly colonial metazoans that make up the physical structure of the reef, with scleractinian corals, demosponges, bryozoans, and CCA all present. The coarser 500 μm-2 mm sample displayed the lowest overall diversity and the fewest unique ASVs. However, this sample type did contain many polychaete sequences (Eunicida) that were not found in the other ARMS fractions. As reported in other ARMS studies, the 100-500 um fraction showed higher diversity than the 500 um-2 mm sample (Leray and Knowlton 2015;Ransome et al. 2017;Pearman et al. 2018). The two sediment fractions both recovered small malacostracan crustaceans (Peracarida), and the fine sample also contained many sequences from demosponges and ostracod crustaceans not found in the coarser fraction.
Notably, all four of the sample types contained unique ASVs, suggesting that complete characterization of reef biodiversity requires metabarcoding of a range of different sample types over time. This provides a complementary picture of reef biodiversity that recovers many sessile components of reef structure and small metazoans that would otherwise be missed in water column sampling. In conclusion, this data set paves the way for a better understanding of eukaryotic biodiversity in this largely pristine reef system, as well as guidance for future studies that should pay careful attention to updated protocols. This includes careful use of replicates at each site, sequencing of extraction blanks and negative / positive PCR controls, and meticulous attention to potential errors in relative abundance estimates (Fonseca 2018;Taberlet et al. 2018;Minamoto et al. 2021). This work demonstrated the breadth of biodiversity accessible through 18S rDNA metabarcoding of different types of marine environmental and biomass samples, and also provided a valuable baseline from a relatively intact and pristine reef system to which other reefs may be compared.

Data availability
The 18S rDNA amplicon gene sequences from this work are posted on the NCBI Sequence Read Archive (SRA) under BioProject number PRJNA804389 (Table 2). Scripts and R code used for processing and analyzing data are available at https://github.com/nerdbrained/palmyra_ edna. All DNA extracts are stored in Forest Rohwer's lab or at NOAA.  Fig. 1), barcode sequence, and number of raw of reads per sample are depicted along with site and sample type information and the NCBI Sequence Read Archive (SRA) accession number within BioProject PRJNA804389. The linker primer sequence for all samples was AACCTGGTTGATCCTGCCAGT.

Conflicts of interest
The authors declare no competing interests.