Software Description |
Corresponding author: Johan Bengtsson-Palme ( johan.bengtsson-palme@microbiology.se ) Academic editor: Florian Leese
© 2019 Shruthi Magesh, Viktor Jonsson, Johan Bengtsson-Palme.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Magesh S, Jonsson V, Bengtsson-Palme J (2019) Mumame: a software tool for quantifying gene-specific point-mutations in shotgun metagenomic data. Metabarcoding and Metagenomics 3: e36236. https://doi.org/10.3897/mbmg.3.36236
|
Metagenomics has emerged as a central technique for studying the structure and function of microbial communities. Often the functional analysis is restricted to classification into broad functional categories. However, important phenotypic differences, such as resistance to antibiotics, are often the result of just one or a few point mutations in otherwise identical sequences. Bioinformatic methods for metagenomic analysis have generally been poor at accounting for this fact, resulting in a somewhat limited picture of important aspects of microbial communities. Here, we address this problem by providing a software tool called Mumame, which can distinguish between wildtype and mutated sequences in shotgun metagenomic data and quantify their relative abundances. We demonstrate the utility of the tool by quantifying antibiotic resistance mutations in several publicly available metagenomic data sets. We also identified that sequencing depth is a key factor to detect rare mutations. Therefore, much larger numbers of sequences may be required for reliable detection of mutations than for most other applications of shotgun metagenomics. Mumame is freely available online (http://microbiology.se/software/mumame).
Antibiotic resistance, bioinformatic tools, metagenomics, mutation frequencies, mutation detection, statistical methods
The revolution in sequencing capacity has created an unprecedented ability to glimpse into the functionality of microbial communities, using large-scale shotgun metagenomic techniques (
Because of the immense increase in available sequence data, it would be desirable to study these mutations from shotgun metagenomic libraries, much as other traits have been studied at a large scale (
In this study, we provide a partial remedy to these problems through the introduction of a software tool, Mumame (Mutation Mapping in Metagenomes), that can quantify and distinguish between wildtype and mutated gene variants in metagenomic data, and through suggesting a statistical framework for handling the output data of the software. In contrast to available tools for investigating nucleotide variants, including StrainPhlAn (
Finally, we demonstrate the ability of Mumame to detect relevant differences between environmental sample types, estimate the sequencing depths required for the method to perform reliably through simulations, and exemplify the utility of the software on detecting resistance mutations in publicly available metagenomes. The Mumame software package is open-source and freely available (http://microbiology.se/software/mumame or https://github.com/bengtssonpalme/mumame).
Mumame is implemented in Perl and consists of two commands: mumame, which performs read alignment to a database of mutations, and mumame_build which builds the database for the former command. The mumame_build command takes a FASTA sequence file and a list of mutations (CSV format) as input. For each entry in the mutation list, it finds the corresponding sequence(s) in the FASTA file, either by sequence identifier or by CARD ARO accessions (
The main mumame command takes any number of input files containing DNA sequence reads in FASTA or FASTQ format and aligns those against the Mumame database using Usearch (
The main output of Mumame is a file with the suffix “.table.txt”. This file contains the reads from each library aligned to the mutation database, with mutation counts in the first set of columns and wildtype counts in the second set of columns. The last line of this file contains the total number of reads in each library, which can be used, e.g., for normalization purposes. The software also saves the output from the Usearch run and, optionally, the read alignments to the database. The output table generated by Mumame can be analyzed using the R script (
The Mumame software is freely available (http://microbiology.se/software/mumame or https://github.com/bengtssonpalme/mumame) and can also be installed via Conda, using the command “conda install -c bengtssonpalme mumame”.
To quantify the abundances of fluoroquinolone resistance mutations in the gyrA and parC genes (
Finally, we investigated data from the experiment by
To assess the limitations of the method in terms of sequencing depth, the samples from the highest and lowest ciprofloxacin concentrations generated by
As a proof-of-concept that our method to identify point mutations in metagenomic sequence data is functional, we used Mumame to quantify the mutations in amplicon data from the gyrA and parC genes. These genes are targets of fluoroquinolone antibiotics, and often acquire resistance mutations attaining high levels of resistance. We quantified such mutations in an amplicon data set specifically targeting these two genes in Escherichia coli. This data set derives from an exposure study with increasing ciprofloxacin concentrations, and enrichments of mutations in the classical fluoroquinolone resistance determining positions S83 and D87 (gyrA) and S80 and E84 (parC) have previously been verified using other bioinformatic methods (
We next evaluated the performance of Mumame on the real shotgun data that was generated from the same samples as the amplicon libraries. Ideally, this analysis should generate virtually the same result as the amplicon analysis. Indeed, we found similar results for the A67 and S83 gyrA mutations (Fig.
Fluoroquinolone resistance mutations in ciprofloxacin-exposed bacterial communities. Total mutation frequencies quantified using Mumame for three known mutations conferring resistance to fluoroquinolone in the E. coli gyrA gene based on amplicon sequencing (A) and shotgun metagenomic data (B) from the same samples. Corresponding data for the S80 mutation in parC is shown in (C) for amplicon data, and (D) for shotgun data.
Noting the much more instable levels of mutations in the shotgun metagenomes, we next investigated the effects of sequencing depth on the ability of our method to detect significantly altered mutation frequencies. For this analysis, we used downsampled data from the shotgun metagenomic library of the ciprofloxacin exposure study (Fig.
Influence of sequencing depth on detected mutations and their effect sizes. Relationship between the number of investigated reads and number of mutations with significantly altered frequencies (A) and the average effect size for those mutations (B); as assessed using Mumame on shotgun metagenomic data from a ciprofloxacin exposure experiment.
After validating the method and testing the limit of detection, we used Mumame to quantify resistance mutations in a similar controlled aquarium setup under exposure to the antibiotic tetracycline (
As a final investigation of the performance of the method, we also let Mumame quantify the fluoroquinolone resistance mutations in river and lake sediments polluted by antibiotic manufacturing waste, primarily ciprofloxacin (
Resistance mutations in antibiotic-polluted sediments. Relative frequency of gyrA and parC sequences with resistance mutations in samples taken downstream, at, or upstream of the pharmaceutical production wastewater treatment plant, as well as in a lake polluted by dumping of pharmaceutical production waste. The numbers at the top of the bars show the total number of gyrA/parC sequences (wildtype or mutated) identified in each sample.
Metagenomics often becomes restricted to investigate gross compositional changes to the taxonomy and functional genes of microbial communities. Unfortunately, this obscures important variation between individual sequence variants that may have large impacts on phenotypes (
The results of the Mumame evaluation also provides a few other important clues on potential pitfalls with inferring mutation frequencies from shotgun metagenomic data. An important such aspect is the disparity between mutation frequencies described by amplicon sequencing and shotgun data. Particularly, the ability to relatively consistently identify the A67 and S83 mutations in parC, while the D87 mutation was seemingly less frequent in the shotgun data is somewhat troubling if the goal is to quantify the actual abundances of such mutations. At the same time, the statistical significance of those differences could still be detected. For the A67 and S83 mutations, only 5 million reads were required for a significant effect to be detected, while for the D87 mutations a sequencing depth of 50 million reads was required. This is not necessarily a shortcoming of the Mumame software, but may just as well be due to the much noisier nature of the relatively few counts from metagenomic sequence data compared to the large number of reads corresponding to the same genes deriving from amplicon data (
Another important potential problem highlighted by our evaluation is the need to produce very large sequence data sets to be able to identify and quantify mutated (and wildtype) sequences with any certainty. As a rule of thumb, the targeted regions represent less than 0.005% of the bacterial genome, and each bacterial strain may correspond to only a fraction of a percent of the reads in the shotgun sequence data (depending on its abundance). This means that to identify a single read from a resistance region in the data, one would need to sequence, on average, more than five million reads. To get a reasonably confident measure of reads stemming from wildtype strains versus strains with mutations, approximately 10 reads from each group would be needed per sample (or, say, 20 reads in total). That would, as a rough estimate, correspond to a hundred million reads per sample. This is, unfortunately, far more sequences than what is typically generated per sample by shotgun metagenomic sequencing projects. Naturally, these numbers would depend on the proportions of the targeted microorganisms as well as their genome sizes, but ultimately this still presents the largest limitation to mutation studies based on metagenomic sequence data, regardless of how sophisticated bioinformatics methods that are used. Potentially, this problem could be partially alleviated by analyzing sufficiently large cohorts and performing the statistical analysis for general trends, but even large cohorts would be insufficient for mutations rare enough to pass below the detection limit.
In terms of interpreting the results from the exposure experiments, it is interesting to note the overall clear increase of fluoroquinolone resistance mutations at the highest ciprofloxacin concentration, which nearly perfectly corresponds to increases in mobile qnr fluoroquinolone genes in the same samples (
While we did not have data from an experimental setup suitable to address differences between sediments exposed to different degrees of fluoroquinolone pollution, the quantification of resistance mutations seems to provide an important piece of information to explain the results of previous studies of resistance gene abundances in these river samples (
We have here shown the utility of the Mumame tool for finding resistance mutations in shotgun metagenomic data. In this paper, we have used the CARD database (
This paper presents a software tool called Mumame to analyze shotgun metagenomic data for point mutations, such as those conferring antibiotic resistance to bacteria. Mumame can distinguish between wildtype and mutated gene variants in metagenomic data and quantify them, given a sufficient sequencing effort. We also provide a statistical framework for handling the generated count data and account for factors such as differences in sequencing depth. Importantly, our study also reveals the importance of a high sequencing depth, preferably more than 50 million sequenced reads per sample, in order to get reasonably accurate estimates of mutation frequencies, particularly for rare genes or species. The Mumame software package is freely available from http://microbiology.se/software/mumame. We expect Mumame to be a useful addition to metagenomic studies of, for example, antibiotic resistance, and to increase the detail by which metagenomes can be screened for phenotypically important differences.
This work was funded by the Swedish Research Council for Environment, Agricultural Sciences, and Spatial Planning (FORMAS; grant 2016-00768).
Table S1
Figure S1