Research Article |
Corresponding author: Nathalie Peyrard ( nathalie.peyrard@inrae.fr ) Academic editor: Gert-Jan Jeunen
© 2024 Marie-Josée Cros, Jean-Marc Frigerio, Nathalie Peyrard, Alain Franc.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Cros M-J, Frigerio J-M, Peyrard N, Franc A (2024) Simple approaches for evaluation of OTU quality based on dissimilarity arrays. Metabarcoding and Metagenomics 8: e108649. https://doi.org/10.3897/mbmg.8.108649
|
An accurate and complete taxonomic description of the diversity present in an environmental sample is out of reach at this time. Instead, metabarcoding is used today and it is expected that OTUs represent a category relevant for biodiversity inventories on a molecular basis. However, artefacts in the production of OTUs can occur at different stages and may impact ecological conclusions. We propose to evaluate the quality of OTUs in a sample by characterising the deviation of each OTU’s dissimilarity array from that of an ideal OTU where all sequences are at distances smaller than the barcoding gap. We consider two deviations: the creation of composed OTUs, corresponding to the artificial merging of several OTUs and the creation of noisy OTUs that contain some sequences that are loosely associated with the core sequence of the OTUs and that do not form a compact subgroup. We propose a simple and automatic 2-step method that successively categorises the OTUs of a sample as composed or single and then identifies OTUs with noise amongst the single ones. The associated code is available at https://forgemia.inra.fr/alain.franc/otu_shape. We applied the method on 32 samples of diatoms from Arcachon Bay (France) that represent contrasted environmental conditions and we obtained good agreement with expert categorisation of OTUs. We suggest that single OTUs without noise can be used as such for further ecological studies. Composed OTUs should be post-treated with classical clustering or community detection tools. The quality of single OTUs with noise remains to be further tested via supplementary studies on a diversity of organisms.
Composed OTU, diatoms, metabarcoding, OTU with noise, support vector machine, stochastic block model
Exponential development of Next Generation Sequencing and High Throughput Sequencing has facilitated the mass production of barcodes in environmental samples with metabarcoding (
OTUs are building blocks of molecular-based inventories and there are various protocols for building them from sets of sequences in an environmental sample. Artefacts in the production of OTUs can occur at different stages (see, for example,
To characterise the notion of quality of an OTU, we refer to an ideal OTU (where all dissimilarities within an OTU are smaller than the barcoding gap) and we identify possible deviations from the theoretical pattern of the corresponding dissimilarities array. Deviations, when they exist, are not random. We study two deviations leading to composed OTUs and OTUs with noise. As defined above, composed OTUs are the artificial merging of several OTUs, as opposed to single OTUs. We propose a new way to identify composed OTUs. Unlike the breaking phase in
We apply the approach on a dataset of diatoms from Arcachon Bay, kindly made available by the Malabar project (
The material required as input of our method for identifying OTU types of a sample is the list of OTUs in the sample together with the array D of pairwise dissimilarities associated with each OTU. Rigorously, the mathematical object is a matrix. However, here and in the following, we use the term array, by reference to the operational implementation of the building of D, since in the code, it is a variable of type array.
We apply our method on a dataset of diatoms from Arcachon Bay. They represent a sampling of the diversity of photosynthetic protists, mainly diatoms, in Arcachon Bay, France. Samples are allocated equally amongst the four seasons (autumn, winter, spring, summer), four locations (Bouée 13, Comprian, Jacquets, Teychan) and two water columns (pelagic high tide and benthic). This yields 4 x 4 x 2 = 32 samples. Sample sizes range between 19,000 and 36,000 reads (Suppl. material
For each sample, pairwise dissimilarities after dereplication between reads have been computed with the Smith-Waterman local alignment score. Due to the size and number of fasta files, we have used the distributed version of disseq called mpidisseq (see https://gitlab.inria.fr/biodiversiton/disseq), run on the cluster CURTA of the mésocentre of Nouvelle-Aquitaine. Hence, a n by n dissimilarity array is attached to each sample if it is composed of n reads.
The dissimilarity array of a sample is denoted D and the dissimilarity between reads i and j is denoted d (i, j) (term at row i and column j of D). In a second step, we computed OTUs from D for each sample. The numbers of reads and OTUs per sample are given in Suppl. material
An OTU is then defined as a connected component of G. The associated subgraph of G is denoted Gotu. It is connected by construction, but it is not always a clique since we can have three elements i,j,k such that d(i,j) ≤ g and d(j,k) ≤ g (therefore i, j and k are in the same connected component), but d(i,k) > g (the barcoding gap). It is the well-known chaining effect. For each sample, we extracted one dissimilarity array per OTU, denoted Dotu. Hence, we worked with 32 sets of dissimilarity arrays. We kept the OTUs with 15 reads or more only, because it would not be meaningful to try to identify groups in smaller OTUs.
We checked that the OTUs obtained are very close to the outputs of SWARM. It is not surprising because our procedure relies on building connected components at a given threshold and this is known to be equivalent to hierarchical aggregative clustering with Single Linkage (
A reference database for the rbcL marker for diatoms is available (
A drawback of the above 3-step procedure for building the OTUs is that two reads can have a dissimilarity larger than g and still belong to the same OTU. Is there a way to define an OTU with the same procedure, but with the guarantee that all dissimilarities within an OTU are below the barcoding gap? In this case, Gotu is a clique. The answer is yes if we use a dissimilarity such that d(i,j) ≤ g and d(j,k) ≤ g implies that d(i,k) ≤ g (it means that the relationship defined by “i relates to j if and only if d (i, j) ≤ g” is transitive). This is possible if and only if d is a distance and is ultrametric. A distance d is said to be ultrametric if it fulfils the condition d (i, j) ≤ max(d (i, k), d (j, k)) for any read k, which is stronger than the classical triangular inequality. Dissimilarities computed as edit distances between two reads are not ultrametric and, therefore, the relationship defined as being at a distance below a barcoding gap is not transitive. On the contrary, the age of the Most Recent Common Ancestor (MRCA) between two reads is ultrametric. If D is built with the MRCA as distance and steps 1 to 3 above are applied, all connected components of G are cliques, and an OTU is a clique. Such an OTU is said to be “ideal”.
Our hypothesis is that the observed deviations from this ideal OTU structure are not random, but are themselves structured. In what follows, using only the dissimilarity arrays, we describe two ways in which an OTU can diverge from being ideal: composed OTU and OTU with noise.
First, we define what is a single OTU. A single OTU is close to what would be a theoretically ideal OTU, where all dissimilarities in Dotu are smaller than the barcoding gap. There may be a few exceptions for some sequences, but we will deal with that in a second step, when defining OTU with noise. The corresponding graph Gotu is composed of a single large strongly connected entity with the possibility of some satellite nodes. Composed OTUs deviate from ideal and single OTU by the fact that they correspond to dissimilarity arrays with a structure of two or more blocks, with intra-block dissimilarities smaller than the barcoding gap and most of the inter-block dissimilarities larger. This leads to a graph Gotu with several entities, where the nodes in an entity are strongly connected and there are few connections between the entities. In graph theory, such a graph is said to have a community structure (
Examples of graph Gotu for three types of OTUs, from top to bottom: (i) ideal OTU, which is single and a clique (each read has a dissimilarity smaller than the barcoding gap with all the other reads of the OTU; (ii) a single OTU with a large strongly connected core entity and some satellite nodes; (iii) a composed OTU, consisting of several entities with high intra-entity connections rates and low between entities connection rates (and some satellite nodes as well).
It is well known that composed OTUs can be produced during the phase of clustering of the sample reads due to the above mentioned chaining effect. It usually corresponds to the grouping of reads from different species in the same OTU. We illustrate this chaining effect on a sample by comparing the species and the OTU that each sequence belongs to. In Fig.
Illustration of the chaining effect. Both figures display the same scatter plot of sample 180912_PM_PEL_B13 (high tide, pelagic, summer, Bouée 13), where one dot is a read (there are 37036 dots), with the first MDS component on the x axis and the second one on the y axis. The two plots differ by the way dots are coloured. In the left plot, dots are coloured according to the OTU they belong to. In the right plot, they are coloured according to the species they have been assigned to. Only the species and OTUs with the 12 largest sizes have been coloured; the remaining ones are coloured in grey (if not, many colours would have been indistinguishable).
Then, amongst single OTUs, we describe a second deviation from an ideal OTU: OTU with noise. For these, there could still be some reads that are loosely associated only with the rest of the OTU and that are too far from each other to form themselves an entity: for such a sequence i, dissimilarities d (i, j) are below the barcoding gap for only a small number of sequences j. These sequences are far from the core sequences of the OTU and they do not form a second entity (as in a composed OTU) since they can be far from each other (see Fig.
In the following, we present a simple and automatic 2-step method that successively categorises the OTUs of a sample as composed or single and then identifies OTUs with noise amongst the simple ones. We will show that composed and single with noise OTUs represent the majority of the OTUs in the different samples of our dataset.
In a first step, we propose an automatic unsupervised method for sorting the OTUs of a sample into two groups: single ones and composed ones. In a single OTU, most dissimilarities in Dotu will be smaller than the barcoding gap. For a composed OTU, there will be a significant proportion of dissimilarities larger than the gap (due to the inter-entity dissimilarities). This is the information we use to discriminate between single and composed OTUs. For a given OTU, we build Gotu from Dotu. We then define θ as the ratio between the number of missing edges in Gotu and the total number of possible edges. The number of missing edges corresponds to half the number of elements in Dotu that are larger than the barcoding gap (since Dotu is a symmetric array and each dissimilarity appears twice). It is equal to Σi < j δ (dotu (i, j) > g) where the sum is over all pairs (i, j) of lines and columns of Dotu where i < j. The function δ is equal to 1 if the condition is satisfied and 0 otherwise. The total number of possible edges in Gotu is equal to , where notu is the number of reads in the OTU. Therefore, θ = 2 Σi < j δ (dotu (i, j) > g)/(notu (notu - 1)). Then, for single OTUs, θ will be small, because very few edges are missing. For composed OTUs, θ will be large. Indeed, let us take as an example an OTU with two balanced entities. There will be few missing edges within each entity, but many edges missing between both entities. If each entity has notu /2 sequences, there are possibly /4 edges between both entities and as many potential missing edges. Hence Σi < j δ (dotu (i, j) > g) ≈ /4 while notu (notu - 1) ≈ . Finally, θ ≈ 1/2.
To sort the OTUs of a sample into composed and single ones, we use θ, which can be computed directly from D. We define a critical value θc as follows. We compute θ for each OTU and we build a smoothed version of the histogram of the θs using a Gaussian kernel (see Suppl. material
Principle of the method for sorting OTUs of a sample into composed and single ones. Example of a smoothed version of the histogram of θ values (ratio between the number of missing edges in Gotu over the total number of possible edges in the OTU): θc is the first local minimum after the first mode, OTUs with θ < θc are single, and OTUs with θ > θc are composed.
We focus now on OTUs identified as single. Later, we will discuss possible tools to split OTUs identified as being composed in order to obtain a clustering of the sample’s reads formed only of single OTUs. In order to determine if a single OTU contains noise reads or not, we propose a fully automatic supervised classification method whose input variables are features derived from the dissimilarity array Dotu. Namely, we use a linear Support Vector Machine (SVM) to discriminate between the two types of single OTUs. To derive the features, we estimate the parameters of a Stochastic Block Model (SBM,
In practice, we assigned an ’expert’ label to each OTU of a training set, amongst ’with noise’, ’uncertain’ and ’without noise’. To do this, we computed the normalised degree βseq of each read of the OTU, defined as the percentage of dissimilarities smaller than the barcoding gap in the row corresponding to this read in the dissimilarity array Dotu: βseq = 100 Σi ≠ j δ (d (i, j) ≤ g)/(notu - 1). If the minimum of βseq over the OTU reads is lower than 20%, the OTU is labelled as ’with noise’; if it is larger than 70%, it is labelled as ’without noise’; otherwise, the OTU is labelled as ’uncertain’. Only OTUs labelled as ’with noise’ or ’without noise’ are used to learn the SVM. Note that this method could be directly envisaged as a candidate for identifying OTUs with noise. However, it is not fully automatic since it relies on two thresholds that were manually defined and some OTUs remain unclassified (’uncertain’ type). We refer to it as the degree-based classifier below.
We summarise here the succession of steps to perform when using our method to identify composed, single with noise and single without noise OTUs of a set of samples. We define two sets of samples:
The identification as composed/simple of each OTU is an unsupervised classification which is done sample per sample. The identification as with/without noise for a single OTU is done OTU by OTU with a SVM classifier which is learned on the training set T. Knowing that, here are the steps for typing all samples in S:
All the steps in the above procedure are elementary and can be written with any language (like python or R). We provide in a Figshare project and a gitlab project (see Section Data Accessibility) a set of programmes which assemble them in a given way and which we used for producing our results. Other solutions are possible and equivalent. The gitlab provides a documentation of the programmes and a tutorial on the assemblage we propose on a subset of the complete dataset (all samples) to save time and memory while running it. It gives some guidelines for the user who wishes to use the programmes on his/her own datasets.
An expert classification can be built, based on visual inspection, in order to validate the output of our identification method. However, we did not built it on the whole dataset since it would require a visual inspection of 2529 dissimilarity arrays. We built it only for one location. We chose the Teychan location, since this location will also be used as a training set for the identification of OTUs with noise in a second step. The Teychan dataset, therefore, refers to the set of the eight samples located at Teychan (two samples per season: one for benthic and one for pelagic). It is composed of 654 OTUs.
Here, we first present the validation of the method for identifying composed OTUs, on the samples located at Teychan. Then, on the whole dataset, we tested the existence of a link between OTU type and OTU size and we analysed the assignation pattern of composed OTUs.
For the Teychan dataset, an expert classification of each OTU of each sample into one of the three categories - composed, single or uncertain – has been built, based on an expert procedure which works as follows. First, heatmaps of the dissimilarity array of each OTU were drawn, with reads ordered according to the leaves of a dendrogram (Aggregative Hierarchical Clustering with Ward criteria,
The contingency table built from the 654 OTUs of the Teychan dataset (Table
Comparison of the expert classification and the automatic classification of the OTUs into the composed and single categories, for the Teychan dataset.
Automatic | ||||
---|---|---|---|---|
Composed OTUs | Single OTUs | Total | ||
Expert | Composed OTUs | 92 | 12 | 104 |
Uncertain OTUs | 11 | 12 | 23 | |
Single OTUs | 9 | 518 | 527 | |
Total | 112 | 542 | 654 |
We then applied the procedure to the whole dataset (the 32 samples). We tested the hypothesis of a link between the OTU type (single or composed) and its size. Suppl. material
Link between OTU size and its classification as composed or single. Statistics of the ranks (the ranks are ordered from smallest to largest size) and the p-value of the Wilcoxon Mann-Whitner test.
Number of OTUs | 2529 |
---|---|
Mean rank for single OTUs | 1163.5 |
Mean rank for composed OTUs | 1778.5 |
p-values | 1.535 x 10-55 |
Amongst the 180 OTUs that were fully annotated with a taxon (see Section Data), eight were categorised as composed. We observed three situations. For two of them, there are two or three species present in the OTU and the dissimilarity array Dotu and graph Gotu are clearly structured into two blocks separating one species from the other(s). This is the typical situation that we target when identifying composed OTUs. Three other OTUs are monospecific and there is no obvious structure in Dotu or Gotu. However, they have the particularity that reads are loosely connected to the others, leading to a large value of θ, larger than θc. Finally, the last three OTUs are monospecific (or nearly) and Dotu and Gotu are nevertheless structured into two blocks. An example of each situation is given in Suppl. material
The method to identify OTUs with noise is a supervised method that requires a training set to learn the SVM. The most discriminant factors when studying community diversity are the season and the water column. This is the reason why we built the training set on one location (Teychan) and the test set on the other three locations. Both sets contain samples associated with different and balanced values for the season and the water column. This training step is performed using only OTUs that have been categorised as with or without noise by the degree-based classifier (uncertain OTUs cannot be used here).
For each choice of features (pair of coefficients of the Λ matrix), we ran a 10-fold cross validation to estimate the error of prediction. We obtained the best Area Under Curve value (AUC = 0.951) with the features f1 = max (Λ(1, 1), Λ(2, 2)) and f2 = Λ(1, 2). The feature f1 represents the mean dissimilarity between two reads of the SBM block with the larger mean intra dissimilarity. If there are noise reads, they should be in this block. The feature f2 represents the mean inter-block dissimilarity in the SBM model. The SVM classifier frontier is defined by the expression y = 9.452 + 0.569 f1 + 0.876 f2. Contingency Table
Comparison of the degree-based classification and the SVM classification of the single OTUs into the ‘with noise’ and ‘without noise’ categories, on the Teychan dataset (training set).
Automatic | ||||
---|---|---|---|---|
OTUs with noise | OTUs without noise | Total | ||
Expert | OTUs with noise | 375 | 6 | 381 |
Uncertain OTUs | 87 | 26 | 113 | |
OTUs without noise | 6 | 42 | 48 | |
Total | 468 | 74 | 542 |
The SVM classifier obtained on the training set is applied to the OTUs of the 24 samples of the test set (i.e. all samples, except those in the Teychan dataset). Since the expert method can also be automated, we can compare the results of the two classifiers. They are reported in contingency Table
Comparison of the degree-based classification and the SVM classification of the single OTUs into the ‘with noise’ and ‘without noise’ categories, on the test set.
Automatic | ||||
---|---|---|---|---|
OTUs with noise | OTUs without noise | Total | ||
Expert | OTUs with noise | 1228 | 0 | 1228 |
Uncertain OTUs | 277 | 24 | 301 | |
OTUs without noise | 16 | 29 | 45 | |
Total | 1521 | 53 | 1574 |
For OTUs categorised as single, we test the hypothesis of a link between the OTU size and its category (with or without noise). The Wilcoxon Mann-Whitney test has been used (based on the single OTUs of the 32 samples) and the results show that there is strong evidence for such a link (see Table
Link between OTU size and its classification as single with or without noise. Statistics of the ranks (the ranks are ordered from smallest to largest size) and the p-value of the Wilcoxon Mann-Whitney test.
Number of OTUs | 2116 |
---|---|
Mean rank for single OTUs without noise | 552.5 |
Mean rank for single OTUs with noise | 1089.7 |
p-values | 3.7x 10-22 |
Amongst the 180 OTUs that were fully annotated, 153 were categorised as single with noise and 23 as single without noise. Ignoring the artefactual presence of sequences of Rhizosolenia fallax species, almost all were monospecific (only two exceptions).
Having applied the two procedures to each of the 32 samples balanced for season, location, water column for identification of composed, single with noise and single without noise OTUs, we computed the proportion of each type per sample. In Fig.
Visualisation of the proportion of composed, single with noise and single without noise OTUs for each sample. Left: dots coloured by seasons, centre: dots coloured by water column, right: dots coloured by location.
The central ternary plot of Fig.
We then considered two other sets of 16 values: the list of percentage of OTUs with noise (amongst the single OTUs) in the benthic samples and in the pelagic samples. We also applied a Wilcoxon rank test and we obtained a p-value of 0.04763. We concluded that there is no evidence that the fraction of OTUs with noise (amongst the single OTUs) in a sample is different for benthic and pelagic conditions.
We did not test whether the other environmental conditions (season, location) have or do not have an influence on the composition in the sample since the number of observations per condition would be too small (8).
The discussion of the quality of the different types of OTUs is organised along a gradient of complexity of the structure of the OTUs, as follows:
The expected structure of the graph Gotu built from the dissimilarity array Dotu is a clique if the dissimilarities are the age of the Most Recent Common Ancestor (MRCA). In such an ideal case, the OTU is obviously reliable. However, in practice, we work with evolutionary distances computed from local alignment scores. The discrepancy between the age of the MRCA and evolutionary distances within a set of sequences increases with the age of the MRCA. It can, therefore, be expected that cliques represent clusters with a relatively young MRCA and that the evolutionary distances within the cluster are closely related to the age of the MRCA. This allows us to postulate that cliques built from evolutionary distances are OTUs of good quality. There are four cliques over all of the 32 samples of diatoms. Three of them have no annotated reads. This may mean that they represent species that are absent from the reference database. One of them is partially annotated, always with the same species. The fact that some reads in the clique are not recovered probably means that mapping reached its limit in terms of quality, because if a query maps on references with different taxa, the mapping is said to be ambiguous and the read is not annotated.
Let us recall that the noise (or the absence of noise) in a single OTU is detected, based on the value of two features f1 and f2, where f1 represents the mean dissimilarity between two sequences of the SBM block with the largest mean intra-block dissimilarity and f2 represents the mean inter-block dissimilarity. A single OTU is typed as “without noise” if the parameters f1 and f2 are both small, as illustrated in Fig.
In order to provide in what follows indications about the quality of an OTU (which means it can be accepted as an OTU for further studies) that is not a clique nor a single OTU without noise, we referred to an external expert evaluation. Although we are agnostic as to whether an OTU has or does not have a taxonomic meaning, we used the mapping of reads on a reference database as external information. If all the reads in an OTU are annotated and assigned to the same species, then OTU picking and taxonomy converge, suggesting that the OTU can be considered of good quality. Otherwise it is questionable. Hence, we focused on fully annotated OTUs in the rest of the discussion.
A single OTU is typed as “with noise” if features f1 and f2 are both large. Such an OTU displays a minority of satellite reads, which are close to (at a distance smaller than the gap) only a small fraction of the remaining reads (the core, the main densely connected entity). In the subsample of fully annotated OTUs, almost all of the single OTUs with noise are monospecific ones, regardless of the quantity or intensity of noise. Whether such a conclusion can be extended beyond fully annotated OTUs is an open question and deserves further studies on a diversity of organisms to progress along this line. Indeed, a partial covering only by mapping can be due to the fact that uncovered reads either belong to another species absent in the reference database, lowering the acceptability of the OTU or that they belong to the same species, but are labelled as unknown due to imperfections and errors in the mapping or the reference database.
Composed OTUs are very likely to be large OTUs and to be composed of two or more entities each of which is a candidate to be a more reliable OTU. However, in the subsample of fully annotated OTUs, we observed some composed OTUs with a different profile: either monospecific OTUs with, overall, a low level of connections in Gotu or monospecific OTUs with a clear structure divided into two blocks. Both cases lead to large values of missing edges and the OTUs are, therefore, typed as composed. In the latter case, one possible reason for the block pattern of the dissimilarity array may be a structure in the intraspecific molecular diversity. However, the number of specimens in one OTU is often too small to check with population genetics indices (see
Finally, the large spurious OTUs, automatically detected by θ > θc, should be reshaped as sets of new and smaller OTUs. Two ways to do this are to build them as outcomes of either unsupervised clustering of the dissimilarity array of the composed OTU or of community detection (see
Recent advances in massively parallel sequencing technology has led to the rapid production of millions of reads. This has opened the way to the analysis of many environmental communities, leading to further exploration of their diversity and ecology, at a pace that was unimaginable beforehand. The building blocks of such studies are sets of OTUs obtained by clustering the reads of a given sample. In this context of massive data, it is no longer possible to scrutinise each OTU one by one to assess its quality and decide to keep it or not, or to reshape it. Here, we propose a tool to make progress in assessing automatically the quality of an OTU, with OTUs streaming through a pipeline. It relies on the comparison between the OTU’s inner structure (given as its pairwise dissimilarity array) and an ideal one and by characterising two ways in which the structure of an OTU can deviate from the ideal situation: first, we distinguish composed vs. single OTUs. Second, amongst the single OTUs, we distinguish OTUs with and without noise. We applied the method on 32 samples of diatoms collected in Arcachon Bay (France) that represent contrasted environmental conditions and we obtained good agreement with expert categorisation of OTUs. We suggest that single OTUs without noise can be used as such for further ecological studies. Composed OTUs should be post-treated with classical clustering of community detection tools. The quality of single OTUs with noise remains to be further tested via supplementary studies on a diversity of organisms.
Our method can be implemented in a pipeline and used automatically and sequentially on a large number of OTUs belonging to one or different samples. This builds a quality filter that enhances the reliability of subsequent studies in ecology and diversity structures that are undertaken on these same data, by strengthening their foundations.
Furthermore, the impact of the dissimilarities and classification methods on the OTUs quality deserves further investigation and the optimal choice can depend on the sample studied. Our tool could also provide a way to identify, for a given sample, the dissimilarities and classifications methods that lead to the set of OTUs with the best intrinsic quality, for example, distances computed from alignment scores (see
We thank the participants of the Malabar project for their authorisation to use the data produced in this project, especially the Laboratoire Environnement-Ressources of IFREMER at Arcachon for the field campaign, Emilie Chancerel and Franck Salin at INRAE BioGeCo for the production of DNA sequences. Computer time for the preparation of the data in this study was provided by the computing facilities of the MCIA (Mésocentre de Calcul Intensif Aquitain). The dissimilarity arrays have been produced in the Malabar project, supported by ”Cote Labex” Call for Research Projects, Year 2017.
The authors have declared that no competing interests exist.
No ethical statement was reported.
No funding was reported.
A. Franc: conceptualisation, methodology, validation, formal analysis, investigation, original draft writing, review and editing. N. Peyrard: conceptualisation, methodology, validation, formal analysis, investigation, original draft writing, review and editing. M.-J. Cros: software, validation, formal analysis, investigation, review and editing. J.-M. Frigerio: data curation, review and editing.
Marie-Josée Cros https://orcid.org/0000-0002-6395-5563
Jean-Marc Frigerio https://orcid.org/0000-0003-0471-2075
Nathalie Peyrard https://orcid.org/0000-0002-0356-1255
Alain Franc https://orcid.org/0000-0001-9448-8569
The codes for learning the noise classifier and for determining the type of OTUs are available in the GitLab project https://forgemia.inra.fr/alain.franc/otu_shape, where the user can find a documentation and a tutorial on a smaller dataset than the one used in our study. The code and the data to replicate our study are available in a Figshare project (
Supplementary information
Data type: pdf
Explanation note: A stochastic Block Model B estimation of θ density for composed OTU identification C figures D tables.