Mutation and RNA Virus Populations
Introduction
It is now widely appreciated that retroviruses and RNA viruses have very high mutation rates. It is obvious that these viruses should generate many mutants, including drug-resistant mutants. What is the magnitude of the problem? A quantitative consideration is essential, for addressing this and other questions about how these viruses evolve.
Mutation During Replication
Suppose we start with a single 'wildtype' sequence, called the 'master sequence' by Eigen [Eigen and Biebricher, 1988]
During replication of this sequence, mutations will occur.
- Assume that the mutation rate, the probability that an incorrect base is inserted at any nucleotide position, is µ.
- For convenience below, we will assume that mutation to each of the 3 mutant bases is equally probable, with a rate of µ/3.
- At any nucleotide position, there is a probability of µ/3 for inserting a particular incorrect base, and a probability of (1-µ) for inserting the correct base during replication.
- We will use E to designate the number of mutations in a progeny genome. The wildtype or master sequence has E=0, single-hit mutants have E=1, etc..
- E is the Hamming distance, the number of differences between a copy and the original item of information, regardless of the nature of the change, or where it occurs.
- Assume that the genome length is L nt.
- Conveniently, site-specific low mutation rates can be compensated for, e.g., by decreasing L, if necessary. As L is usually large (>>103), the few unusual sites that might exist can simply be ignored for the following calculations.
- A population of a virus with a long genome and relatively low mutation rates is similar to a population of a virus with a shorter genome but with higher mutation rates (view the data). Thus, one can choose some arbitrary L for the examples shown below, and generalize the results for other L's by shifting the graphs left or right along the X-axis. Alternately, you can also download the spreadsheet used to obtain the results shown here, to look at other genome sizes.
Some nomenclature and symbols
- Error class. It is convenient to consider all sequences with the same number of mutations as a group. (The mutations may be anywhere in the genome). Such a group is called an error class. For example, all mutants with E mutations belong to the E-error class. They are also refered to as mutants that have a Hamming distance of E from the reference, master sequence.
- Sampling density = Fraction of the total possible sequences that are actually present in a population. When all possible sequences are present in the population, the sampling density is 1, and the sampling is said to be saturated.
- Exponents: These pages use superscripts (e.g. 2 x 103) and subscripts (Li), that are not supported by older browsers. If you see, e.g., 2 x 103, it signifies 2 x 10 to the 3rd power = 2000.
The probability of producing a perfect copy of the master sequence
The polymerase makes no errors. It inserts the correct base, with a probability of (1-µ), at each of all L nt. The probability of this happening, i.e., the relative abundance of the master sequence in the progeny population, is:
p{E = 0) =
Many families of RNA viruses have segmented genomes (e.g., Arenaviruses, Bunyaviruses, Orthomyxoviruses, Reoviruses). The probability of replicating a wildtype copy of a virus with a segmented genome is the product of the probability of making a wildtype copy of each of the segments:
Where Li is the length of the ith genome segment. Thus viruses with segmented genomes produce wildtype progeny genomes with the same probability as an unsegmented virus whose genome length equals the sum of the genome segment lengths of the segmented virus.
The probability of producing a copy with 1 specific mutation
At nucleotide position i; the polymerase inserts a particular incorrect base, with a probability of µ/3; and it inserts the correct base, with a probability of (1-µ) at the remaining (L-1) positions:
p(E = 1, at position i) =
Note that a different mutant, with a single mutation at a different site, has an identical probability.
p(E = 1, at position j) =
The number of different sequences, all with the same number of mutations, will be considered below.
The probability of producing a copy with E specific mutations
At each of E specific positions, the polymerase inserts a specific incorrect bases, each with a probability of µ/3; and it inserts the correct base with a probability of (1-µ) at the remaining (L-E) positions. The probability of this happening, i.e., the relative abundance of this specific sequence, is:
[Eqn 1]
p(sequence with E mutations, to E specific bases, at positions i, j, k,...) =
Effect of mutations on the population of progeny genomes
The relative abundance of each sequence in any population is given by the equations above. These relative abundances multiplied by the population size give the actual abundance of each sequence in the population.
We will look at populations of:
- 104 progeny genomes, e.g., those produced by a single master sequence during a single infection cycle;
- 109 progeny genomes, e.g., those produced in a culture infected by ca. 105 master sequences, each producing 104 progeny genomes.
To reiterate, the graphs shown below plot the probabilities, as calculated above, multiplied by the population size (104 or 109). To obtain the corresponding values for some other population size, just divide the values in either graphs, and multiply by the population size desired. You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes.
Figure 1. The abundance of individual sequences in the progeny population
The results are based on µ = 10-5 to 10-3, and L=11703 (The genome length of Sindbis virus, an Alphavirus in the Togaviridae family).
| Abundance of individual sequences with 0, 1 or 2 errors in a population of 104 progeny genomes. |
Abundance of individual sequences with 0 to 3 errors in a population of 109 progeny genomes. |
For the 104 population:
For the 109 population:
- As with the smaller population, the abundance of the master sequence (E=0) decreases with increasing µ, and it is the most abundant individual over the range of mutation rates considered here.
- There are up to 104 copies of any arbitrary 1-error sequence
- Corollary: All possible 1-error sequences are present, each with abundance of up to 104 copies! Coffin [1995] reached an identical conclusion, though using a different approach.
- Thus, increasing the population size from 104 to 109 dramatically increases the abundance and sampling density of the mutants
This is an example of the obvious: as more virus genomes are produced, more kinds of mutants are generated, and they will be more abundant. It should be equally obvious that the earlier the intervention, the less likely resistance to antivirals will develop.
- Imagine that a virus can mutate to drug resistance with a point mutation. The drug-resistant variant will be produced and is certainly present in any population more than a million in size, even before drug treatment. This prediction has been verified for drug-resistant mutants of HIV, in that resistant viruses can be found in populations that have never encountered the drug [e.g., Nájera, et al., 1995; Tucker, et al., 1998].
- The probability for the presence of any specific 2-error sequence in the 109 population is 10-3 to 10-1, with a maximum of 0.43.
- However, if the population size had been 1010, then there are several copies of any specific 2-error sequence in the population, for mutation rates from 4 x 10-5 to 4 x 10-4.
- The probability for the presence of any individual mutant with 3 or more errors is low (<10-4).
The number of mutants in each error class
The derivation above applies to individual, specific sequences (e.g., a mutant with a G to C change at nucleotide position 3456).
It is important to realize that there are many different sequences all with the same number of mutations.
The abundance of each error class in the population
The relative abundance of each error class is the product of the abundance of a specific sequence [Eqn 1] and the number of possible sequences [Eqn 2] in that error class:
[Eqn 3]
i.e., the binomial distribution
Thus, even though the probability of a specific sequence, e.g., one with 2 mutations, is very low, there are very many of them. So, they are quite abundant when considered as a group. Illustrative data are shown in Figure 2.
(Reminder: the graphs below plot the probabilities multiplied by 104 or 109. Just divide the values in the graphs by 104 or 109, and multiply by a different population size to get data for the latter. You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes.).
Figure 2. Abundance of each error class in a population.
The results were calculated using a genome length of L = 11703 nt, and µ = 10-5 to 10-3.
| Population of 104 progeny genomes. |
Population of 109 progeny genomes. |
Considering the population of 104, e.g., the progeny from a single round of infection by a single master sequence:
- The master sequence (E=0) goes from more to less abundant than the mutants as a group.
- The population contains large numbers of mutants, even mutants with large numbers of errors are present in the population.
For example, when µ = 10-4:
- The master sequence constitutes 31% of the genomes in the population. i.e., 69% of the genomes in the progeny from a single round of infection are mutants!
- About 3600 (ca. 10%) of the 35109 possible 1-error sequences are present. The abundance of the 1-error class as a whole (ca. 3.6 x 103 copies) is actually a little higher than the abundance of the master sequence (ca. 3.1 x 103 copies).
- About 2100 of the 2-error sequences, ca. 3 x 10-6 of the possible, are present, at ca. 1 copy each.
- About 830 different 3-error mutants are present.
- Mutants with 4, 5, or more errors are present at progressively lower numbers; down to a few copies of the 7-error class of mutants
- Mutants with 8 or more errors are unlikely to be present.
Similar, but even more dramatic conclusions apply to the larger population:
- The master sequence (E=0) goes from more to less abundant than the mutants as a group.
- The population contains large numbers of mutants, even mutants with large numbers of errors are present in the population.
For example, when µ = 10-4:
- The master sequence constitutes 31% of the genomes in the population. i.e., 69% of the genomes are mutants!
- About 104 copies of each of all 35109 possible 1-error sequences are present. The abundance of the 1-error mutant class as a whole, about 3.6 x 108 copies, is a little higher than the abundance of the master sequence (ca. 3.1 x 108 copies).
- About one-third of the total possible 2-error sequences , ca. 2 x 108 different mutants, are present, at ca. 1 copy each. Had the population been 10-fold larger, each of the 620 million different 2-error sequences would be present in the population, for mutation rates from 4 x 10-5 to 4 x 10-4.
- Almost 108 different 3-error sequences are present. They represent only 0.001% of the possible 3-error sequences.
- Mutants with 4, 5, or more errors are present at progressively lower numbers.
- Remarkably, there are several hundred 10-error mutants in the population. However, only an astronomically small fraction (ca. 10-37) of the possible 10-error sequences is sampled.
- The typical manual or automated sequencing procedure cannot detect bases whose abundance is less than about 20% of the total. Thus the master sequence and the consensus sequence of the population are the same over much of the mutation rates considered here.
The sequence diversity of the population
The sequence diversity of the population consists of the single master sequence, plus all the different mutants in the population.
Diversity of progeny genomes in a population of 104
The sequence diversity in a population size of 104, e.g., progeny from a single infection, is shown using a linear (left panel) or a log (right panel) scale, with L=11703, and µ = 10-3 to 10-5. (Again, you can download the spreadsheet used to obtain the results shown here, to look at other population sizes):
Figure 3. Sequence diversity in a population of 104 progeny genomes
- The single master sequence (E=0) contributes a diversity of 1 over the entire range of mutation rates considered.
- At mutation rates between 10-5 and 10-4
- mutants of the 1-error class contribute the bulk of the diversity, with some 1000 to 3600 sequences. Sampling of the 1-error class is not saturated in this range. Only 1 or very few copies of each sequence is present in the population.
- The 2-error class contributes from 60 to 2100 sequences. Only 1 or very few copies of each of these sequences is expected to be present.
- At higher mutation rates, mutants of the 2-, 3-, and higher error classes in turn contribute the bulk of the diversity. Again, each of the error classes is sampled at very low densities, and each sequence, when present, is present at a low copy number.
- The sequence diversity of the population, e.g., almost 7000 at a mutation rate of 10-4, is only slightly less than the population size!
Similar conclusions apply when we consider a population of 109:
Figure 4. Sequence diversity in a population of 109 progeny genomes
- The single master sequence (E=0) contributes a diversity of 1 over the entire range of mutation rates considered.
- Figure 1 shows that any of the 1-error sequences is present at 1 to 104 copies each, over the entire range of mutation rates considered. As there are 3.5 x 104 1-error sequences, their contribution is a constant 3.5 x 104 over the whole range of mutation rates.
- Other error classes contribute variable amounts of diversity (e.g., about 2 x 108 sequences are contributed by the 2-error class at µ=10-4), because the sampling of these classes is not saturated.
- At mutation rates below ca. 10-4, mutants of the 2-error class contribute the bulk of the diversity. At these mutation rates, about 1% to one-half of the possible 2-error sequences is randomly sampled (see Figure 1). Those that are present occur at low copy numbers.
- At higher mutation rates, mutants of the 3-, 4-, 5-error and higher error classes in turn contribute the bulk of the diversity. Again, each of the error classes is sampled at very low densities, and each mutant is present at a low copy number.
- The sequence diversity of the population is only slightly less than the population size!
The total diversity available for selection is enormous!
- Different virus populations will contain different, random samplings of the unsaturated error classes. Thus the number of sequences potentially available for selection depends on the specific composition of any particular population, and it also strongly depends on the total number of progeny genomes that are present in the several populations.
- If any of the mutants, with E mutations, is viable, its progeny constitutes additional diversity originating at E steps away from the master sequence
- These ensembles of diverse and inter-converting (L-dimensional hypercube) sequences are called quasispecies by Eigen, and in all likelihood they constitute the unit for selection. (Selection operating on quasispecies will be considered on a separate page).
An analogy may be useful, to help visualize the quasispecies
This is a picture of the globular star cluster called M13 (Messier 13), in the Hercules constellation.
|
We can imagine that each point in regular 3-dimensional space corresponds to a sequence. The stars represent those sequence that are actually present in the population. At the center of the cluster is the master sequence. Immediately surrounding it are sequences with 1 error. Sequences with 2, 3, and more errors are progressively farther out.
|
The sampling of sequence space by the virus population is locally dense, with very diverse, random, but low density sampling at farther distances from the master sequence. For example, with L= 11703, µ=10-4, and a population size of 109, the space within a Hamming distance of 0 or 1 is saturated, and it is 33% saturated at a Hamming distance of 2.
What about even larger virus populations?
Take HIV as an example: some 30 million humans are infected globally. Before the onset of AIDS, each produces about 109 to 1010 progeny genomes per day. This is equivalent to a daily global HIV production of ca. 1017 genomes.
With 1017 genomes, all of error classes 0 to 3 are saturated, error class 4 is about 4% saturated, and the saturation of error class 5 is ca. 10-6. The saturation of error classes 6 and higher is less than 10-10. |
 |
Summary
We have so far limited ourselves to considering only the number and diversity of mutants produced during viral replication. A priori, it is not possible to predict whether the mutants are viable, or what their fitnesses might be.
Nevertheless, we can hazard some reasonable deductions, based on what is known about specific viral mutants. For example, some HIV mutants with multiple mutations are known to be viable, and are multiply drug-resistant. The fact that error class 3 is saturated with viral populations as small as 1014 suggests that any specific, triple mutant of HIV that is multi-drug resistant is generated a thousand times each day in the global population.
How selection might act upon the mutants is considered in more detail in the next page.
ReferencesThose that are not available through PubMed of the National Library of Medicine, USA.
- Eigen, M., and C. K. Biebricher. 1988. Sequence space and quasispecies distribution, p. 211-245. In E. Domingo and J. J. Holland and P. Ahlquist (ed.), RNA Genetics: Variability of RNA Genomes, vol. 3. CRC Press Inc., Boca Raton, LA.
|