Randomly Generated Biologically Functional Proteins?
By Mike Gene
Keefe and Szostak recently reported the isolation of ATP-binding proteins from a random sequence pool and conclude that that functional proteins are sufficiently common in protein sequence space. However, their methods introduced elements that that detract from any attempt to assess the role of purely stochastic means and the function they isolate appears to be functionless in a biological context.
How likely is it that a randomly generated sequence of amino acids would fold into a functional protein? Frances H. Arnold, Dickinson Professor of Chemical Engineering and Biochemistry at California Institute of Technology, helps us to appreciate this problem from the perspective of a scientist who designs proteins:
Evolution may sound easy just make mutations and see what happens. But that's not the case if you care about where you're going. Without a good strategy, your experiments are doomed to failure. That's because a typical protein has some 300 amino acids in its chain, and, with 20 letters in the amino-acid alphabet, there are 20300 ways to string those letters together. That's huge beyond imagination; huge beyond the number of protons in the universe. And this sequence space, if you will, is mostly empty at least, mostly empty of the function you're interested in. So if you just wander around willy-nilly, it's not going to be a very useful exercise. For that reason, we do what nature does we carry out local explorations of the space around existing, functioning molecules. We jump a natural enzyme through a new hoop, and accumulate mutations that help it jump higher. [1]
Yet a recent study published in Nature suggests functionality in populations of random polypeptides is pretty common. [2] The abstract to the article is as follows:
Functional primordial proteins presumably originated from random sequences, but it is not known how frequently functional, or even folded, proteins occur in collections of random sequences. Here we have used in vitro selection of messenger RNA displayed proteins, in which each protein is covalently linked through its carboxy terminus to the 3' end of its encoding mRNA, to sample a large number of distinct random sequences. Starting from a library of 6 x 10^12 proteins each containing 80 contiguous random amino acids, we selected functional proteins by enriching for those that bind to ATP. This selection yielded four new ATP-binding proteins that appear to be unrelated to each other or to anything found in the current databases of biological proteins. The frequency of occurrence of functional proteins in random-sequence libraries appears to be similar to that observed for equivalent RNA libraries.
The authors end their article with this rather bold conclusion: "we suggest that functional proteins are sufficiently common in protein sequence space (roughly 1 in 10^11) that they may be discovered by entirely stochastic means, such as presumably operated when proteins were first used by living organisms." Unfortunately, there seem to be two fundamental flaws in this paper that render such a conclusion unwarranted (I strongly suggest you obtain and read this paper to get the most out of the following analysis).
The protein "function" was not randomly generated.
As a consequence of their methods, these researchers were able to isolate proteins that did bind ATP with very respectable affinity. However, to find such proteins, the researchers essentially had to abandon a truly stochastic process. Let me explain.
The experiment starts out fine. They generate 6 x 10^12 random proteins, incubate with an ATP-agarose affinity matrix, wash off the unbound material, retrieve the bound material, and then use the bound material to start the cycle over again. Basically, they are purifying this random population such that only those sequences that bind ATP remain. After eight rounds of this cycle, the fraction of ATP binders rose from 0.1 to 6.2%. At this point, the researchers decided to take a look at the sequences that were binding and found the binders to be dominated by four distinct families of proteins (none show similarity to each other or any other biological protein). This, however, is not all that impressive, because it basically means that these four classes of ATP-binders bound ATP very weakly (in other words, the binders were about 20 times more likely to exist in an unbound state than bound to ATP). In fact, the researchers themselves noted, "One possible explanation for this low level of ATP-binding is conformational heterogeneity, possibly reflecting inefficient folding of these primordial protein sequences."
At this point, the researchers switched gears. They used PCR to introduce point mutations into the binders for 3 consecutive rounds, at a mutagenic rate of 3.7% per amino acid for each round. After these three cycles of mutagenesis [3], the researchers went back to the original procedure. But in my opinion, this is cheating the "entirely stochastic means." When point mutants are introduced, most of the protein sequence is held constant so that we can then sample in the nearby area. Furthermore, because of the nature of the genetic code, the search was no longer truly random.
Consider an analogy at this point. Say we want to know the odds of being dealt a Royal Flush as our first hand and must rely on purely empirical means. Okay, we begin to get discouraged when, after being dealt 5^7 hands, we don't get a Royal Flush. So we change the rules slightly. We start again, and after 1000 deals, are dealt a Jack, Queen, and Ace of Hearts. We then keep the cards and ask to be dealt only two this time around. Clearly, no matter how long it takes, our chances of getting a Royal Flush are now much higher than if we had to be dealt a fresh 5 cards every time around.
This is essentially what the researchers did. They had something that "kind-of" bound ATP and held it mostly constant as they tweaked around its periphery. And what's more, when you analyze the results of such tweaking, something interesting happened. First, the end result was that only one of the original four protein families (family B) was now present and modified. Were the other three competed away so quickly or could they not be improved? It would be interesting to repeat the experiment without a representative of family B. Even better is that the modifications seemed to selectively enrich 20 positions and among these, 7/20 were positively charged amino acids (something that is expected to non-specifically tighten up association with a negatively charged molecule).
The bottom line is that what found by "entirely stochastic means" were the four weak binders. The proteins with the respectable binding affinities were essentially found by introducing manipulations that add to "entirely stochastic means," thus rendering the calculations essentially useless for these proteins. The notion that roughly 1 in 10^11 randomly generated proteins will contain "functional" species is without support.
Is the "function" really a function?
The function uncovered by this paper is very weak ATP-binding . But I seriously question if we can call this a function from a biological perspective. In biology, a function usually looks like this: X is modified by Y such that X now modifies Z. In other words, there is a consequence entailed by function. Yet there is no consequence in the proteins described in this paper - they simply weakly bind ATP. Reductionism has the ability to lose sight of our important biological context.
Of course, at the core, all proteins basically do is bind things. Even enzyme catalysis is about the binding of transitional states. But to view proteins as mere binders is like viewing carpentry as merely nailing boards together. In biology, what matters is not merely binding, but how things bind, where things bind, and when they bind. What matters is the context of binding. And it is not surprising that Nature paper does not describe ATP hydrolysis activity of these proteins. That would entail not just binding ATP, but also the specificity required to position the substrate so that it could bind another network of amino acids that position themselves to stabilize the transitional state.
Thus, to bring this paper into a position where it is relevant to "living organisms", we have to go back to the drawing board and demonstrate the following:
Starting with a random pool of polypeptides, isolate proteins that both specifically bind and hydrolyze ATP.
But even this would not be sufficient, as ATP hydrolysis by itself is not important to biology (in fact, it is simply an energy sink). A further modification would look as follows:
Attach protein A to a column instead of ATP. Subject it to a random pool of polypeptides that will stick to the protein in an ATP-dependent fashion. That is, without ATP, the randomly generated polypeptide would not bind, but upon binding ATP, it could bind.
This would be a much more impressive result, but it still has the problem of front-loading the whole system with protein A, where the binding polypeptide is isolated as a function of protein A, thus proteins A's existence likelihood would have to be factored into the equation. If any researchers want to demonstrate that biologically functional proteins are sufficiently common in protein sequence space, then they will need to do as follows:
From a randomly generated pool of polypeptides, isolate two (or more) proteins among the pool that interact with each other in an ATP-dependent fashion.
I predict that if such proteins are ever found this way, a random pool much larger that 6 x 10^12 will be needed.
Further observations
There are some interesting features of the strong-ATP binders that were generated by the additional input of point mutations. The researchers uncovered evidence that their ATP-binding ability might require a coordinated metal ion to constrain the protein's conformation. They note that this "result suggest metal ion coordination may be one of the simplest ways of generating folded proteins while minimizing the information required to specify a function sequence." Yet, they also note "no known biological nucleotide-binding domain is a zinc-stabilized structure." ATP-binding is a universal feature of life and should be expected to reflect ancient states. Thus, it is interesting that the "simplest way" of generating folded ATP-binding proteins was not the way the life apparently chose. Could this again reflect that mere ATP-binding is biologically irrelevant?
Furthermore, in order the characterize the strong-binders, they themselves had to be fused to an E.coli protein as "the solubilities of the free proteins were too low to permit full characterization." It seems likely that even these improved proteins were still floppy enough to interact with each other and aggregate. This would seem to be another indicator that these are proteins that would be irrelevant to life forms.
Finally, one more comment on the notion that functional proteins are sufficiently common in protein space such that some ill-defined primordial organism would find them. What is this elusive mechanism that goes about sampling from hundred thousand million candidate proteins to find one success case (and keep in mind that the "success" in this experiment is so unimpressive that there is no reason to think any biological organism would view it as functional)? No one has ever offered the slightest evidence to suggest the prebiotic soup was charged so many 80-residue proteins. We could always appeal to the imaginary processes of the imaginary riboorganisms, but why would they expend so much energy generating so much waste? Given that even this most meager, unimpressive, and biologically irrelevant success would only come about every hundred thousand million proteins, it would seem that selection would prevent organisms from expending energy digging this dry hole. In fact, evolution itself seems to indicate that this number is not "sufficiently common." It is well known in evolutionary biology that evolution reuses things rather then invents de novo. Standard evolutionary theory teaches that new proteins are formed by reshuffling (exon shuffling) or tweaking copies of pre-existing proteins (gene duplication). Evolutionary theory does not seriously factor in de novo proteins generated by random processes because it apparently does not happen (or if it happens, it is so rare to be unimportant to the general processes of evolution) [4]. So just where does this study apply to biology?
Citations
1. http://pr.caltech.edu/periodicals/EandS/articles/arnold/arnold.html
2. Keefe AD, Szostak JW. Functional proteins from a random-sequence library. Nature 2001 Apr 5;410(6829):715-8
3. In is interesting to note that the mutagenesis rounds actually decreased the %binders - see figure 2.
4. The closing sentence of the article is interesting: "However, this frequency is still low enough to emphasize the magnitude of the problem faced by those attempting de novo protein design."
Tangential Addendum
Let's take the human ubiquitin protein. It is 76 amino acids in length and has the following sequence:
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
Now, if we could generate a random pool of 76mers, containing 1x 10^99 peptides, human ubiquitin might very among them. That is, since there is no natural law that prevents the above sequence from forming randomly, it can happen.
But from here, things get tricky. What is the function of ubiquitin? Its "official" biological function is as follows:
quote:
INVOLVED IN THE ATP-DEPENDENT SELECTIVE DEGRADATION OF CELLULAR PROTEINS, THE MAINTENANCE OF CHROMATIN STRUCTURE, THE REGULATION OF GENE EXPRESSION, THE STRESS RESPONSE, AND RIBOSOME BIOGENESIS.
Okay. But now let's express human Ub in E. coli? What's this? The 76mer with the exact same sequence no longer has a biological function, even though it is expressed in a biological organism. Worse yet, let's send a crop-duster back into prebiotic times and spray the oceans with human Ub. Will it have a "biological function?" It certainly won't be INVOLVED IN THE ATP-DEPENDENT SELECTIVE DEGRADATION OF CELLULAR PROTEINS, THE MAINTENANCE OF CHROMATIN STRUCTURE, THE REGULATION OF GENE EXPRESSION, THE STRESS RESPONSE, AND RIBOSOME BIOGENESIS.
If we are to accept that these proteins demonstrate "biological function," it is important to define "biological function." Y'see, human Ub has no function when transformed and expressed in E. coli (as is the case for most human proteins). And it has no function on a sterile planet. What this means is that the "function" or "biological activity" is not sufficiently encoded in the amino acid sequence. That was merely a necessary ingredient. Something else is needed - a context that extracts the functional potential encoded in the sequence. E. coli and sterile planets fail to provide the appropriate matrix to unleash Ub-function. In other words, one could argue that in E coli and on the sterile earth, human Ub have no informational content (but here things get philosophical, so I won't go there). But this why any informational calculation that focuses only on the sequence of a protein-in-question is a vast underestimate. The protein's function is a symbiotic interplay between that sequence and the matrix.
So let's again turn from Ub and to the Keefe and Szostak proteins that weakly bind ATP. If you ask someone to explain why these proteins exhibit biological function, the answer is usually along the line that since they bind ATP, and ATP binding is important to life, it's a biological function. Of course, this still doesn't define "biological function" to help us determine if this specific activity qualifies as such. Nevertheless, all we have is a vague analogy. That is, these proteins have the superficial appearance of biological function and that's all. That is, these proteins kinda look like something we might find in a cell (as long as we don't look close enough), so hey, let's call their activity a "biological function." That's not a good way of thinking about things. Let's consider this "function." First, have anyone expressed these ATP-binding proteins in any biological organism and determined it to have a biological function? No. Well, then there is no empirical justification for declaring these proteins to have biological function. It's that simple and that easy to derail any claim about "biological function."
But let's simply focus on "function." Remember, as we saw with human Ub, 'function' needs a matrix. Well, what is the function of the Keefe and Szostak proteins? They weakly bind ATP. And? Well? What's the function of this ATP binding? To get a paper published. I'm not trying to be cynical. Apart from serving publishing goals, what else do these weak ATP-binders do? What else can they do? The matrix that extracts the "function" of these proteins is completely artificial - Keefe and Szostak's lab and the world of the scientific literature. Drop these proteins in E. coli and do they function? No. Drop them in the ocean and do they function? Nope. Drop them on the prebiotic earth. Would they function? No one has offered a single solid reason to think they would. After all, consider the simple fact that no one has generated ATP in a realistic prebiotic setting. The one thing these proteins bind to wasn't even there. But let's pretend ATP was present! And these proteins did bind ATP now-and-then. How is this a "function?"
So y'see, when it comes to the OOL, there is simply no reason to think all the research into "randomly generated function" is even relevant. You first need to establish the functions needed to convert earth processes into the first life forms at the base of our biotic history. What were the minimal set of "functions" involved? Then you can go on a fishing expedition to see how easy it would be to generate this. To put it another way, the answer to the question, 'Can random forces generate CSI?" is.......it depends. It depends on a) the function you are invoking and b) the nature of the historical circumstances during the time/place you invoke the function. Define those and the question can be answered. In other words, I don't see this as a philosophical generalization. Never have. I see it as a historical question.
Thus, if we are talking about primordial planets and the OOL, the real question is this:
Did the context of the pre-biotic Earth suffice for generating the type of CSI needed to spawn the type of Life that is known to exist? [