Cytosine Deamination and Evolution: A View From Both Sides
By Mike Gene (2/1/03)
I have previously illustrated that the effects of cytosine deamination, expressed through the conventional genetic code, appear to bias mutational consequences by increasing a protein's hydrophobicity and predisposition to form secondary structures. [1] This form of bias may have been designed into life's matrix to facilitate evolution. [2]
Up to this point, however, I have only explored the consequence of deamination in the coding strand of DNA (the strand of the double helix that is not transcribed, thus shares the same coding sequence of the RNA product formed by using the transcribed DNA strand). Yet, since C is normally paired with G, C-T transitions on either strand of any gene will also result in G-A transitions of the opposite strand. In other words, the effects of cytosine deamination on a gene with the sequence CGATGTAACGTAGTA will experience transitions at both the C and G sites.
I originally focused only on the C-T transitions because of an asymmetry that exists as a function of transcription. As explained by From Klapacz and Bhagwat [3]
quote:
Transcription is an inherently asymmetric process that separates transiently the two strands of DNA and copies one strand as RNA. One DNA strand (the transcribed or template strand [TS]) is paired with 8 to 9 nucleotides of RNA in the transcription bubble and is enveloped by the RNA polymerase. The other DNA strand (the nontranscribed or nontemplate strand [NTS]) is unpaired and is thought to lie on the outside of the RNA polymerase (9). This asymmetry creates differential sensitivities of the two DNA strands within the bubble for chemical probes such as hydroxyl radicals and permanganate (8, 11). Beletskii and Bhagwat have shown that there is also asymmetry in the susceptibility of cytosines in the two strands to deamination (3). Cytosines in the NTS are up to 10 times more likely to deaminate to uracil than those in the TS (1, 3), and in a strain of Escherichia coli defective in uracil excision (genotype ung), this causes a strand-dependent increase in C-to-T mutations. We refer to instances of this phenomenon as transcription- induced mutations (TIM). The extent of this susceptibility of cytosines in the NTS to deamination is roughly proportional to the frequency of transcription of the gene (1). This phenomenon has been seen with plasmid-borne as well as chromosomal genes (2) and with genes transcribed by the T7 RNA polymerase (4). Further, the frequency of cytosine deamination is directly related to the length of time the transcription bubble stays open (4).
Nevertheless, since the difference in susceptibility is only 10-fold, it would be prudent to survey the effects of GA transitions that are coupled to the CT transitions as a consequence of cytosine deamination.
There are 48 possible transitions that involve codons with G residues. Of those 48, 15 are silent and 2 are nonsense mutations, leaving 31 amino acid substitutions. The amino acid encoded by codons with G are methionine, valine, alanine, aspartate, glutamate, cysteine, arginine, serine, and glycine. As a consequence of GA transitions, this pool is changed to isoleucine, methionine, threonine, asparagine, lysine, tryosine, histidine, glutamine, serine, aspartate, arginine, and glutamate. Thus, while CT transitions started with a small pool (7) and expanded it slightly (9), GA transitions begin with a larger pool (9) and expand it further (12). Using the hydrophobic scale employed in the original analysis [1] and weighting these values with the number of codons for each amino acid, the average change in hydrophobicity change for the two pools is -0.04 (where the negative means the pool amino acid pool unleashed by GA transitions is only slightly less hydrophobic). This is in striking contrast to the change that is mediated by CT transitions, where there is an average increase in hydrophobicity of 0.364. Furthermore, 86% of C->T mutations increase hydrophobicity (the Pro-Ser changes are the only exceptions), while 15/31 G->A mutations (48%) decrease hydrophobicity. It should be pointed out, however, that 8 of these 15 involve glycine codons and these account for most of the loss in hydrophobicity (where glycine is replaced by serine, aspartate, glutamate, and arginine). If we omit the glycine residues (see below), the the G->A mutations do unleash a more hydrophobic pool, but only with an average increase of hydrophobicity of 0.07.
Another expression of this asymmetry can be seen by considering the number of conservative changes involved with both sets of mutation. As seen before, if we employ the Gonnet Pam250 matrix to the effects of CT transitions, only 4 of the 27 substitutions involve switches among the "strong" group of amino acids (L -> F; H -> Y) and another 8 occur among the "weak" group (A -> V; P ->S). Thus the majority of these mutations involve nonconservative changes in amino acid properties. In contrast, the effects of GA transitions entail mostly conservative changes, 19/31 amino substitutions draw from the same "strong" group, 6/31 draw from the "weak" group, and only 6/31 entail nonconservative changes (cys->try; gly -> arg; gly -> glu).
Thus, at this level, it would appear that adding in the effects of GA transitions do not significantly alter the effects of CT transitions, as the former mutations result in conservative changes with modest changes in hydrophobicity, leaving the Increasing Hydrophobicity Effect, as mediated by CT transitions, largely intact. In Figure 1a, the hydrophobicity change (weighted by codon usage) for each amino acid substitution mediated by CT transitions is plotted. Figure 1b shows the same analysis for GA transitions. Note that in 1b, most of the changes cluster around zero.

FIGURE 1
. Hydrophobicity Changes Associated with Mutations. The original amino acid in all cases is assigned a value of 0 and the weighted change in hydrophobicity is the end point. A positive slope indicates an increase in hydrophobicity associated with that particular amino acid substitution.Since three of the same amino acids are affected by CT and GA transitions (serine, alanine, and arginine), the net effect of both mutations was calculated and factored into Figure 1c, which shows the complete effect of both CT and GA transitions. Note that with the exception of the addition of one line with a significant negative slope, and some changes that have little effect, figure 1c is very similar to 1a.
Another way to approach this phenomenon is to consider only the residues that experience nonconservative mutations. Let us assign the following values to every amino acid: 2= nonconservative change; 1 = change that draws from a weakly conserved amino acids; 0 = no change or a change that draws from a strongly conserved set of amino acids. Each value is then weighted by the number of codons involved. The results are shown in Figure 2.

FIGURE 2
. The twenty amino acids are represented on the x-axis. The y-axis scores the nature of the amino acid substitution and the number of codons for that amino acid that are changed. For example, there are four codons for alanine that are changed to valine. Since alanine and valine are part of the same "weak" group of amino acids, the score is 4 (4x1). The numbers above each bar represent the weighted hydrophobicity change. Blue bars represent amino acids changed through CT transitions; red bars represent amino acids changed by GA transitions.
As can be seen from Figure 2, only a small set of amino acids are targeted for radical replacement and most of the changes are mediated through C-T transitions. The net effect of adding GA transitions is to include cysteine and glycine as targets. With the exception of glycine and proline, all the amino acids are replaced by residues with increasing hydrophobicity. The glycine and proline changes are significant in that they involve large increases in hydrophilicity.
That both glycine and proline are most commonly targeted by CG->TA transitions is most interesting. Glycine has the smallest residue volume, while proline is about twice as large. What unites these two amino acids is that they are considered "strong helix breakers," where both are the least likely to be found in an alpha helix (both with a Chou-Fasman probability of 0.57). In my original essay, I noted that almost all of the substitutions involve an increased predisposition to form both alpha helices and beta sheets. I originally used a contrived scale that partitioned the amino acids according to the Chou-Fasman rules. [1] In light of the fact GC transitions target both of the strong helix breakers, I analyzed the effects of CT and GA transitions in light of quantitative scores of the conformational preference for each amino acid (these data are found in Creighton's text, Proteins, p. 256).
Consider proline as an example. Two CT transitions change proline to leucine. The frequency of prolines found in alpha helices is 0.34, while the leucine frequency is 1.34. The difference is 1.0 and since four codons are involved, of +4.0 signifies an increase predisposition to form an alpha helix. Another four proline codons changed by C-T transitions yield serine, with a frequency of 0.57. These two codons yield a score of +0.92. Thus, together, proline substitutions are given a score of +4.92 (a negative score would indicate a decreased preference for alpha helices). The results of this type of analysis for all amino acid changes are shown in Figure 3.

FIGURE 3
. Changes in secondary structure preference as a consequence of CT transitions (blue), GA transitions (red), and CG-TA transitions (yellow). a. Alpha helix preferences; b. beta sheet preferences; c. turn preferencesWhen both forms of mutations are considered together, the vast majority of substitutions lead to an increased predisposition to form beta sheets (Figure 3b). This increased preference is mostly driven by the CT transitions, but is modestly supplemented by the GA transitions. The alpha helix preference pattern is not as clear cut, and seems to depend mostly on the removal of both strong breakers, proline and glycine. A clear pattern is seen in the turn preference, where all of the substitutions with significant magnitude point to a distinct drop in preference for the turns that are located between helices and sheets.
SUMMARY
The Increasing Hydrophobicity Effect as mediated by C->T transitions is not significantly altered by factoring in the G->A transitions that would be coupled (Figure 1). Most of the amino acid substitutions brought on line by including G->A transitions are either conserved or of modest effect (Figure 1 and 2). The only exception is to include glycine in the amino acid pool that is targeted, meaning that the net effect of CG->TA transitions is to target both of the strongest helix breakers for possible removal, thus further facilitating the formation of new secondary structure (Figure 3). Basically, what this means is that the effects of CT mutations elicit the IHE and increase the predisposition to form secondary structure, while the only thing the incorporation of the GA mutations seem to add is an increased predisposition to form alpha helices and eliminate turns.
The most fascinating aspect of these relationships is their dependence on the double-stranded genome. The consequences of CT mutations predominate in this dynamic and are also enhanced by the transactions that take place because a double-stranded DNA molecule is transcribed by a large protein complex. And it is also quite remarkable that proline codons are C-rich, while glycine codons are G-rich, meaning that when both strands are factored, both of "the strong helix breakers" are targeted for removal [4]. It is quite a coincidence that the two residues that have long been noticed to share this feature [5] just happen to be subject to the most common form of base substitution.
1. Evolutions Design, TeleoLogic BR No.18
2.
The IHE and Beyond, TeleoLogic BR No.193.
Klapacz J, Bhagwat AS. Transcription-dependent increase in multiple classes of base substitution mutations in Escherichia coli. J Bacteriol 2002 Dec;184(24):6866-72.4. A nice example from myoglobin is shown here.
5. A typical example is found at Basic Biology for Mathematicians.