How not to sample protein space

Mung has drawn our attention to a post by Kirk Durston at ENV. This is my initial reaction to his method to establish the likelihood of generating a protein with AA permease (amino acid membrane transport) capability.

Durston: “Hazen’s equation has two unknowns for protein families: I(Ex) and M(Ex). However, I have published a method to solve for a minimum value of I(Ex) using actual data from the Protein Family database (Pfam),

Translation: I have published a method to solve for a minimum value of I(Ex) among proteins that presently exist.

I downloaded 16,267 sequences from Pfam for the AA permease protein family. After stripping out the duplicates, 11,056 unique sequences for AA Permease remained.

Translation: I took some proteins that actually exist. I implicitly assume that they are a representative, unbiased sample of all the AA permeases that could exist.

the results showed that a minimum of 466 [think he means 433 – that’s the number he plugs in later anyway] bits of functional information are required to code for AA permease.

Translation: the results show that the smallest number of bits in this minuscule and biased sample of the entire space is 433.

Using Hazen’s equation to solve for M(Ex), we find that M(Ex)/N is less than 10^-140 where N = 20^433.

Translation: starting from my extremely tiny sample of protein space, multiplying up any distortions (eg those due to common origin or evolution) and ignoring redundancy, modularity, exaptation, site-specific variations in constraint and the possibility of anything more economically specified than an existing protein, the chance of hitting a 433-bit AA permease by a mechanism not actually known in biology is – ta-dah! – 1 in 10^140.

441 thoughts on “How not to sample protein space

  1. colewd,

    Allan has said that you can cut these odds by substructures like helixes but we have no identified mechanism how the genome would create this organization on its own.

    That’s just tripe. I have told you how.

  2. Allan Miller,

    That’s just tripe. I have told you how.

    I guess your explanation was grossly inadequate IMHO. Partial explanations are not real explanations. To explain a process you need all the process steps. I think your idea is cleaver but only a baby step toward explaining how a 500 aa protein can form from a stochastic process linking one generation to the next.

  3. Mung,

    That evolution stops because it’s found an adaptive peak for some protein function. More explicitly, that exploration of sequence space ceases. One of your objections to Kirk is based on this misguided notion.

    Wrong.

    You are missing the point. Mutation is constantly probing in all directions – in that sense, evolution does not stop. But if all sequences are adaptively downhill from the current one, they will not fix (or are much less likely to) in the population. Selection keeps the population on top of the hill for that sequence, a narrow band of the broader space of possibilities.

    The concept of “a completely unrelated novel sequence which functionally replaces the first” is a straw-man.

    No it isn’t. It is completely germane to the sampling issue. If you ‘sample’ but cannot go to new regions of space, you aren’t sampling the space, you are just wandering round your own little neighbourhood. If there are no de novo sequences, only commonly descended ones, Kirk’s case is hardly boosted.

    If a gene is duplicated and subsequently diverges from the original it doesn’t have to functionally replace the first. Would you say this is not a continuation of the exploration of sequence space?

    It is irrelevant. Kirk is talking about a specific function. That was what my ‘misguided’ sketch was intended to illustrate – the reason why selection biases the population of surviving sequences for that function. You claim I am ‘misguided’ because I did not go off on your wild tangent.

    I’ll try again. Remember we are talking about function X, not some other function that is not X, nor sequences derived from one that has function X.

    A) An initial population has no function X.
    B) A poor version of function X arises and provides advantage. Therefore it fixes.
    C) A better version of function X arises. Therefore it fixes.
    D) C iterated; eventually the population reaches an adaptive peak with respect to function X for that sequence.

    Now in order to ‘thoroughly explore’ sequence space, there has to be an opportunity for a different sequence, unrelated to the first, to spread, in order to expand the sample outside of the immediate neighbourhood. Exploring your neighbourhood is not a thorough sample.

    But there already is a tuned version in residence. Any novel sequence needs tuning before it can compete with the first. And therefore, once a functional permease has arisen and been tuned, opportunities for another are severely curtailed. New random sequences and related variants of the resident one are all highly unlikely to be better than the resident tuned version. And therefore, selected space, as represented by surviving lineages, in no way adequately represents actual space.

  4. colewd,

    I guess your explanation was grossly inadequate IMHO. Partial explanations are not real explanations. To explain a process you need all the process steps.

    For such generalities, one has to explain a process in principle. I have given several in-principle mechanisms by which sequences can lengthen and pieces become translocated between proteins. Further, when we look at actual proteins we have excellent evidence that this has actually happened

    I think your idea is cleaver but only a baby step toward explaining how a 500 aa protein can form from a stochastic process linking one generation to the next.

    And therefore – if I cannot provide a detailed step-by-step audit for a protein of your choice – it formed in a big bang from a space with 20^500 members by some random-acid-gluing mechanism that does not even occur in biology? You don’t think that’s a little – ah – inconsistent?

  5. colewd: In multicellular life average protein size is around 500 aa. So given this probability that would create a range of 10^50 to 10^320.

    … a range of what?

    colewd: The total evolutionary resources available since the beginning of the earth is less than 10^50.

    What is a total evolutionary resource? Do you mean the maximum number of point mutations?

    colewd: Allan has said that you can cut these odds by substructures like helixes but we have no identified mechanism how the genome would create this organization on its own.

    Yes we do, it’s all the other types of mutations known empirically to happen. Gene duplications, exon shuffling, insertions (not restricted to single base-pairs at a time), deletions (also not restricted to single base-pairs at a time), frameshift mutations, transposons, gene fusions and so on.

    You really need to get beyond this view of mutation as restricted to just a single base-pair being replaced by another.
    Some times smaller chunks of protein coding genes are duplicated and inserted into other genes (which would qualify as a partial gene fusion), for example. Some times an entire protein coding region is duplicated and placed end-to-end with another protein coding gene, producing a giant protein that folds differently. Some times a single base-pair is deleted in a protein coding gene, causing a shift in the reading frame at some point in the coding region. It can be the whole gene being affected, or it can be the last 20 amino acids of that 500 amino-acid gene, or what have you. All of these things are known to happen, they are among the types of mutations that can happen.

    The causes of these mutation types are known. Skipping polymerases due to brownian motion, chance misalignments of chromosomes during homologous recombination, certain bases being more prone to methylation and misrepairs bla bla bla. This stuff isn’t just being pulled out of a hat, there’s half a century of experiments in biochemistry and molecular biology detecting the types and figuring out the causes of these mechanisms.

  6. Allan Miller,

    First, one has to explain a process in principle. I have given several in-principle mechanisms by which sequences can lengthen and pieces become translocated between proteins. Further, when we look at actual proteins we have excellent evidence that this has actually happened

    Yes, you have given an argument in principal. Also some good ideas but IMHO not nearly enough to overcome the sequential space challenge that Kirk is articulating. At this point I think stochastic mechanisms are not likely to be the driver of new proteins. I have other reasons to believe this but will wait until the right time to add to the comments. BTW I think you are playing a remarkable chess game even though you started without a queen:-) If you give me your email in a private message I will send you some PDF papers on alternative splicing.

  7. Rumraket,

    Yes we do, it’s all the other types of mutations known empirically to happen. Gene duplications, exon shuffling, insertions (not restricted to single base-pairs at a time), deletions (also not restricted to single base-pairs at a time), frameshift mutations, transposons, gene fusions and so on.

    Can you identify a protein from a single heliix structure? If not how does the process start? IMHO I don’t think any of the above can create a functioning sequence of 100aa or more. They may move functioning sequences around but where did the sequences come from?

  8. Mung: That evolution stops because it’s found an adaptive peak for some protein function. More explicitly, that exploration of sequence space ceases.

    Nobody says this happens. What we’re saying is that it is unlikely for a newly discovered protein with the same function as an already existing protein to fix in the population, because the already existing one would probably be functioning much better because it had already been tuned for selection. In other words, the new one will get constantly outcompeted.

    Mung: The concept of “a completely unrelated novel sequence which functionally replaces the first” is a straw-man. If a gene is duplicated and subsequently diverges from the original it doesn’t have to functionally replace the first. Would you say this is not a continuation of the exploration of sequence space?

    Duplicated genes only diverge when either selection is relaxed, or there is a second function that can be tuned further by separating the selection into two separate genes. This has happened remarkably frequently in evolution from functionally promiscous ancestor sequences.

    A good example is enzymes that work on a host of related carbohydrates. The rate of substrate conversion cannot be fully optimized in the same gene for all the individual carbohydrates, so multiple duplications have made it possible for the individual proteins to be tuned by selection towards each individual substrate.
    Reconstruction of Ancestral Metabolic Enzymes Reveals Molecular Mechanisms Underlying Evolutionary Innovation through Gene Duplication

    Abstract

    Gene duplications are believed to facilitate evolutionary innovation. However, the mechanisms shaping the fate of duplicated genes remain heavily debated because the molecular processes and evolutionary forces involved are difficult to reconstruct. Here, we study a large family of fungal glucosidase genes that underwent several duplication events. We reconstruct all key ancestral enzymes and show that the very first preduplication enzyme was primarily active on maltose-like substrates, with trace activity for isomaltose-like sugars. Structural analysis and activity measurements on resurrected and present-day enzymes suggest that both activities cannot be fully optimized in a single enzyme. However, gene duplications repeatedly spawned daughter genes in which mutations optimized either isomaltase or maltase activity. Interestingly, similar shifts in enzyme activity were reached multiple times via different evolutionary routes. Together, our results provide a detailed picture of the molecular mechanisms that drove divergence of these duplicated enzymes and show that whereas the classic models of dosage, sub-, and neofunctionalization are helpful to conceptualize the implications of gene duplication, the three mechanisms co-occur and intertwine.

    View image

    PS: How the hell do I post images intead of just linking them?

  9. colewd,

    Yes, you have given an argument in principal. Also some good ideas but IMHO not nearly enough to overcome the sequential space challenge that Kirk is articulating.

    Sigh. The whole point of my objection to the ‘sequence space challenge’ is that it ignores everything but point mutation. So naturally, if there are things other than point mutation, they must be incorporated in a response. I don’t see these things really being taken on board here.

    The sequence space challenge (in its basic form; Kirk’s is slightly different) becomes irrelevant if one accepts gross rearrangement (including end joining, insertion and STOP->codon mutation, which all lengthen proteins), and reduced alphabet, which limits the base.

    There is no point obsessing about 20^500 space if a 500-aa protein derives by extension from primordially shorter units and a smaller amino acid alphabet.

    Extension and alphabet expansion are certainly well within the grasp of stochastic processes.

  10. Allan Miller: There is no point obsessing about 20^500 space if a 500-aa protein derives by extension from primordially shorter units and a smaller amino acid alphabet.

    Perhaps I’ve been asleep for part of this thread.

    Are we still doing tornado in a junkyard?

    Protein coding strings zapped into existence at their current length?

  11. petrushka: Perhaps I’ve been asleep for part of this thread.
    Are we still doing tornado in a junkyard?
    Protein coding strings zapped into existence at their current length?

    Yep.

  12. I guess my inability to figure out why a grown-up would make such a silly case is why I have found the whole thread to be rather weird..

  13. I don’t know where else to ask, so can anyone here tell me how the hell do I post images intead of just linking them?

  14. Rumraket:
    I don’t know where else to ask, so can anyone here tell me how the hell do I post images intead of just linking them?

    You should have a large box labelled “Leave a Reply”. Within that box is the area where you type your comment. Just above the area where you type your reply and within the larger box is a smaller box labelled Upload file [<- CLICK ON IT]. That file can be an image. Hope that helps.

  15. colewd:
    Allan Miller,

    I guess your explanation was grosslyinadequate IMHO. Partial explanations are not real explanations. To explain a process you need all the process steps. I think your idea is cleaver but only a baby step toward explaining how a 500 aa protein can form from a stochastic process linking one generation to the next. [emphasis added]

    There you have it: The canonical Creationist “not detailed enough!” rationalization for rejecting evolution, stated explicitly and directly. Given that Creationists favor an ‘explanation’—namely, “God did it”—which has essentially no details whatsoever, one may be forgiven for noting that their “not detailed enough!” rationalization is applied in a conveniently selective manner, and that their real reason for rejecting evolution has little (if anything) to do with their ostensibly stated reason for rejecting evolution.

  16. Allan Miller: The whole point of my objection to the ‘sequence space challenge’ is that it ignores everything but point mutation. So naturally, if there are things other than point mutation, they must be incorporated in a response. I don’t see these things really being taken on board here.

    This is the sort of comment by you that I have in mind when I say you contradict yourself. These are other ways of moving about in sequence space and you admit their existence and even argue that we’re not paying enough attention to them!

    Hopefully in the not too distant future I’ll have an OP up on Compositional Evolution.

  17. Does this work?
    Edit: seems it did. Anyway as can be seen here from the figure in the paper I referenced above, an ancestrally promiscuous enzyme capable of metabolizing several different carbohydrates could not improve in catalytic rate towards any substrate without sacrificing rate for another. Duplications opened up the capacity for selection to optimize both enzymes towards both classes of substrates.

    Legend: Figure 2. Duplication events and changes in specificity and activity in evolution of S. cerevisiae MalS enzymes.
    The hydrolytic activity of all seven present-day alleles of Mal and Ima enzymes as well as key ancestral (anc) versions of these enzymes was measured for different α-glucosides. The width of the colored bands corresponds to kcat/Km of the enzyme for a specific substrate. Specific values can be found in Table S2. Note that in the case of present-day Ima5, we were not able to obtain active purified protein. Here, the width of the colored (open) bands represents relative enzyme activity in crude extracts derived from a yeast strain overexpressing IMA5 compared to an ima5 deletion mutant. While these values are a proxy for the relative activity of Ima5 towards each substrate, they can therefore not be directly compared to the other parts of the figure. For ancMalS and ancMal-Ima, activity is shown for the variant with the highest confidence (279G for ancMalS and 279A for ancMal-Ima).

  18. petrushka: I guess my inability to figure out why a grown-up would make such a silly case is why I have found the whole thread to be rather weird..

    I think you find the thread weird because it’s about actual science.

  19. Mung: This is the sort of comment by you that I have in mind when I say you contradict yourself. These are other ways of moving about in sequence space and you admit their existence and even argue that we’re not paying enough attention to them!

    No but Mung what people are objecting to is the claim that evolution has thoroughly explored all of sequence space. It can’t, even with modular protein evolution included. So what we’re essentially arguing is that there’s a middle ground between two extremes, at the one end the view of evolution as proceeding exclusively from a stepwise point-mutation manner, and the other end evolution as being almost an all-powerful sampler of amino acid sequence space that has discovered all possible function. Neither picture is correct.

    So evolution finds new protein domains (and their functions) by modular evolution, but optimizes those domains through selection and mostly point-mutations towards particular tasks. And that duplications allow this process to happen in parallel. Evolution finds a function by modular evolution, this function is then optimized towards some local adaptive peak with selection. Simultaneously, duplications happen, allowing further exploration into other parts of sequence space, further away from the current function. This is still not enough to have discovered all possible function in all of sequence space, but it is enough to discover new functions while retaining and optimizing already existing ones.

    Once you include all the known processes of mutations, drift and selection, evolution actually would work to produce what we see in terms of diversity of proteins in extant life, giving the estimated resources of populations organisms that have existed during life on Earth.

  20. Rumraket: No but Mung what people are objecting to is the claim that evolution has thoroughly explored all of sequence space. It can’t, even with modular protein evolution included

    So let me try to lay out the logical alternatives, as I see them. The reason that evolution has not explored all of sequence space are:

    1.) Evolution has not had enough time.
    2.) Evolution has not had enough resources.
    3.) Evolution is constrained.

    Regardless of the amount of time available to evolution, all of sequence space cannot possibly be explored.

    Regardless of all the resources available to evolution, all of sequence space cannot possibly be explored.

    Because evolution is constrained to specific outcomes by prior outcomes.

    Personally, I would take that to be a design argument.

  21. Why would the cell need to allow amino acids a method of entry? How did the cell survive prior to coming up with a solution to the problem of how to allow entry of amino acids?

  22. Mung:
    Why would the cell need to allow amino acids a method of entry?

    How did the cell survive prior to coming up with a solution to the problem of how to allow entry of amino acids?

    Your question is formulated in such a way as to assume there was a period in life when cells could not import amino acids.

    There probably was never such a period in the history of life. Extant amino acid permeases are more likely a byproduct of modern phospholipid membranes being pretty much impermeable to amino acids without a permease.

    Speculation of course, but to elaborate, the origin of such permeases probably coincided with a period of transition from something else than phospholipid membranes, probably simpler, less effective lipid or fatty acid membranes of some sort (which, by the way, are permeable to small molecules such as amino acids.
    So if modern membranes evolved from such a state of life, as phospholipid biosynthesis evolved, membranes would gradually become less and less permeable as phospholipids were incorporated into the cell membrane. During this period, peptides synthesized that could function as simple pore-forming channels would be highly favored.

  23. Rumraket: Your question is formulated in such a way as to assume there was a period in life when cells could not import amino acids. There probably was never such a period in the history of life.

    Perhaps cells could do without AA permease. But then what advantage does AA permease provide? Let’s imagine a cell membrane that allows amino acids to freely enter and to freely leave [no AA permease demon].

    Amino acids good, other stuff bad.

  24. Mung: So let me try to lay out the logical alternatives, as I see them.

    Alternatives to what?

    Mung: The reason that evolution has not explored all of sequence space are:

    1.) Evolution has not had enough time.
    2.) Evolution has not had enough resources.
    3.) Evolution is constrained.

    Regardless of the amount of time available to evolution, all of sequence space cannot possibly be explored.

    No, it is not regardless of, it is because of. The time available to evolution (roughly thought to be about 4 billion years) is not enough to thoroughly explore all of sequence space.

    Obviously as you add more time, a larger space can be explored.

    Mung: Regardless of all the resources available to evolution, all of sequence space cannot possibly be explored.

    Same as above, it is not regardless of, it is because of there being physical limitations that all of sequence space has not been explored.

    Mung:Because evolution is constrained to specific outcomes by prior outcomes.

    No, you can’t go from it is constrained in the range of probable outcomes to it is constrainted to specific outcomes.

    That’s like saying you can throw 10 dice, then randomly mutate the result for 3 of the dice, and then declare that because the outcome was constrained by the result of the first throw, the end-product must be designed to be what it is. That just doesn’t follow it isn’t even implied.

    Mung:Personally, I would take that to be a design argument.

    But there’s nothing in that which implies design. At all. The only thing it’s implying is that there is some degree of determinism, that the range of possible effects are constraibed by prior causes. That would be true whether there is design or not.

  25. Mung: Perhaps cells could do without AA permease. But then what advantage does AA permease provide? Let’s imagine a cell membrane that allows amino acids to freely enter and to freely leave [no AA permease demon].

    Amino acids good, other stuff bad.

    I thought I already described this. Phospholipid membranes evolved for other reasons, which I assumed you knew (they’re more stable than fatty acid membranes).
    But that came with a price in the sense of the membrane being less permeable (the more % of the membrane lipids are phospholipids, the less permeable it is to small molecules like amino acids, but the more stable it also is).
    That means there was two selective pressures. On the one hand, more stable cells survive more environmental disturbances, on the other hand if they can’t take in important nutrients, they can’t grow and divide and leave descendants. So the second selective pressure would be towards forming channels. This of course implies the membrane didn’t jump from permeable to fully impermeable in a single generation.

    So to answer the question, what advantage did AA permease provide? It allowed amino acids to enter cells that were more stable due to incorporating phospholipids in their membranes, so they could stay functional while becoming more resistant to physical disturbance.

  26. I thought I remembered something about this particular question a few years ago and went and dug up this paper:
    Physical effects underlying the transition from
    primitive to modern cell membranes

    Itay Budin, and Jack W. Szostak

    “To understand the emergence of Darwinian evolution, it is necessary to identify physical mechanisms that enabled primitive cells to compete with one another. Whereas all modern cell membranes are composed primarily of diacyl or dialkyl glycerol phospholipids, the first cell membranes are thought to have self-assembled from simple, single-chain lipids synthesized in the environment. We asked what selective advantage could have driven the transition from primitive to modern membranes, especially during early stages characterized by low levels of membrane phospholipid. Here we demonstrate that surprisingly low levels of phospholipids can drive protocell membrane growth during competition for single-chain lipids. Growth results from the decreasing fatty acid efflux from membranes with increasing phospholipid content. The ability to synthesize phospholipids from single-chain substrates would have therefore been highly advantageous for early cells competing for a limited supply of lipids. We show that the resulting increase in membrane phospholipid content would have led to a cascade of new selective pressures for the evolution of metabolic and transport machinery to overcome the reduced membrane permeability of diacyl lipid membranes. The evolution of hospholipid membranes could thus have been a deterministic outcome of intrinsic physical processes and a key driving force for early cellular evolution.”

    “Discussion
    The experiments presented here demonstrate that the synthesis of diacyl phospholipids would have been highly beneficial for early protocells featuring membranes composed of fatty acids and their derivatives. The chemical pathway from fatty acids to the simplest phospholipid, phosphatidic acid, occurs via successive acyl- and phosphotransfer reactions. Although the intermediates in this pathway, glycerol monoesters and lysophospholipids, stabilize fatty acid bilayers to divalent cations (15) and varying pH (27), they exchange rapidly between bilayers (28) and thus would not stay localized to a single cell. Thus, there is no selective advantage for a genomically encoded catalyst that would enable internal synthesis of these intermediates, even though an environmental source of such lipids would be beneficial. In contrast, diacyl lipids, such as phospholipids, are firmly anchored to the membrane (t1∕2 of hours to days; refs. 28 and 29) because of their decreased solubility. Therefore, the synthesis of phosphatidic acid by the acylation of a lysophospholipid with an activated fatty acid is the first step in this pathway for which a genomically encoded catalyst would confer a selective advantage. An acyltransferase ribozyme that catalyzes this reaction, analogous to the protein acyltransferases ubiquitous in phospholipid synthesis, would therefore be sufficient to drive protocell growth and could have been selected for during early cellular evolution. Once such acyltransferases had evolved, there would have been a selective advantage for the synthesis of phospholipid precursors, because they would remain associated with their host cell via incorporation into diacyl lipids.

    We have argued that phospholipid-driven competition could have led early cells into an evolutionary arms race leading to steadily increasing diacyl lipid content in their membranes. We have also shown that such a transition in membrane composition would have come at the expense of membrane permeability. Cells adopting increasingly phospholipid membranes would have therefore been effectively sealing themselves off from previously available nutrients in their environment. What selective pressures would such a predicament impose on early, heterotophic cells? One possibility is that membrane transporters, a hallmark of modern cells, would have emerged as a means for overcoming low membrane permeability. Although protein channels and pumps are complex molecular assemblies, early transporters could have formed from short peptides (30) or nucleic acid assemblies (31, 32), perhaps in complexes with cationic lipids. Additionally, cells could have evolved the ability to synthesize their own building blocks from simpler, more permeable substrates (metabolism) (Fig. 5). Early catalysts, such as the phospho- and acyltransferases proposed here for phospholipid synthesis, could have been adapted for metabolic tasks such as sugar catabolism and peptide synthesis (33), respectively. The emergence of phospholipid membranes would also have allowed early cells to utilize ion gradients (30), which rapidly decay in fatty acid membranes (34), and to explore new environmental niches characterized by lower monoacyl lipid concentrations. Hence, early changes in cell membrane composition and permeability, driven by the simple physical phenomena demonstrated here, could have been an important driver of the evolution of metabolism and membrane transport machinery.”

  27. The simplest of these functions is that of a permeability barrier, which limits free diffusion of solutes between the cytoplasm and the external environment. Although such barriers are essential for cellular life to exist, a mechanism by which selective permeation allows specific solutes to cross the membrane must also exist. In contemporary cells, such processes are carried out by transmembrane proteins, which act as channels and transporters. Examples include the proteins that facilitate the transport of glucose and amino acids into the cell …

    ETA: – Protocells: Bridging Nonliving and Living Matter

  28. Yeah I think to some a working definition of life is cellular life. That without the boundary, there really isn’t a good way to indentify and delineate a living entity from it’s non-living surroundings. It’s a view I agree with, which is why I’m not satisfied with appeals to things like self-replicating RNA to explain the origin of life. When I ask “how did life come to exist?” what I really am asking for is “how did there come to be cells that could grow, divide and evolve?”. I wouldn’t call a self-replicating string of RNA “life”. That’s just me.

  29. Allan Miller,

    Sigh. The whole point of my objection to the ‘sequence space challenge’ is that it ignores everything but point mutation. So naturally, if there are things other than point mutation, they must be incorporated in a response. I don’t see these things really being taken on board here.

    I want to take a break here because I don’t completely agree with this statement but I am very glad we are expanding the discussion beyond point mutations. Have you seen Larry Moran’s last OP?

  30. Allan Miller,

    There is no point obsessing about 20^500 space if a 500-aa protein derives by extension from primordially shorter units and a smaller amino acid alphabet.

    I agree you have an argument here but the 20^500 cannot be dismissed out of hand as an obstacle. 10^20 cannot be dismissed out of hand as an obstacle. Oh Rumraket is he trying to use the tornado in a junk yard argument again? Give me an fn break.

  31. colewd,

    I am very glad we are expanding the discussion beyond point mutations.

    What? I did that on April 12th in this thread, and earlier in others. I keep repeating it ad nauseam.

    (I feel I may have to make this point several times more yet) neighbouring sequences do not merely consist of those one or two point mutations away from each other. Segments of sequence can be moved by copy/cut and paste, and reciprocal recombination, both within and between ORFs. This makes a massive difference to the number of paths available. Proteins do not appear to have been assembled by N random picks from a 20 acid library, nor modified solely by bit-position substitution.

    This is the reason bitwise analysis is deeply misleading. Well, that and the fact that the ‘alphabet’, even now, does not lead to a base-20 version of a binary switch at every site. Well, that and the fact that positions are not independent. Well, that and … !

  32. Mung,

    Me: The whole point of my objection to the ‘sequence space challenge’ is that it ignores everything but point mutation. So naturally, if there are things other than point mutation, they must be incorporated in a response. I don’t see these things really being taken on board here.

    Mung: This is the sort of comment by you that I have in mind when I say you contradict yourself. These are other ways of moving about in sequence space and you admit their existence and even argue that we’re not paying enough attention to them!

    I think your perception of inconsistency in my position is completely off the mark. I perceive for my part that you routinely misunderstand me. The fault may lie in my failure to be clear, but there is no contradiction between that statement and anything else I have written. I have indeed pointed out the importance of crossover and other recombinational methods for GAs.

    Changing a protein involves changing anything from a single bit to deleting and inserting whole chunks. Bitwise analysis implicitly assumes a limited number of mechanisms – either a protein arises by serially extending a protein one random residue at a time by some means until there are 500, or by taking an existing 500 aa protein and subjecting every bit to random change at the same time. Neither of these happens.

    If instead such a protein arises by end-joining two existing 250-acid sequences, it is irrelevant that there are 20^500 members of the final peptide’s space.

  33. It’s always curious when people ask the ‘what use would it be?’ question, then say ‘must be designed’. If it’s no use, why design it?

  34. I find it occasionally useful to reduce things to extremes. If some cataclysm obliterated every species on earth but one, and Kirk came along, oblivious to this, and did his sampling exercise on the one species left, would we argue that this represented a thorough shakedown of protein space?

    This is the problem with this dataset. It is not a sample of the number of ‘solutions’ visited by evolution, it’s a sample of what’s left. It’s only a set with multiple members*** because there have been more taxonomic bifurcations than extinctions in the clade with living survivors.

    ***eta – rather: only capable of being a set with multiple members.

  35. Rumraket,

    When I ask “how did life come to exist?” what I really am asking for is “how did there come to be cells that could grow, divide and evolve?”. I wouldn’t call a self-replicating string of RNA “life”. That’s just me.

    I would! If there were such a thing …

    It could be that vesicles were a necessary physicochemical environment for it to get off the ground. But until you get replication, it’s just foam. The moment you get replication, and its exponent exceeds 1, you have the same basic condition you have now, and all that flows from it. There is not a clear demarcation anywhere else, on subsequent amendment. If we need a dichotomous category, that is.

  36. An important point about multi-residue mutational methods that lead to chimaeras of some kind is that both components have been through selection – they have been ‘tuned’. The space of proteins has been winnowed of sequences that do not fold readily or repeatably. It has been enriched in sequences that (for example) bind to a membrane. So if a peptide would benefit from occupying a membrane location, the means to get it there is already available and does not have to be re-invented every time.

    It is not simply a question of adding random residues, but randomly combining existing stretches.

  37. colewd: I want to take a break here

    What usually happens is that they go away and when they come back it’s like the previous conversations never happened.

    The longer you stay here the more education you will receive and the more you will see what you thought of as unassailable is in fact melting away under the light of knowledge.

    That’s already happening. Every question you have is answered, every objection comes down to insufficient knowledge which is soon clarified. Given that your current position is static and unchanging and befret of answers, you already know you are on a downhill slope. That is if you truly are a truth seeker rather than a Mung.

  38. Well, I’m back, but dismayed at the volume of contributions to the conversation since I left. I’ve skimmed through quite a few and have chosen some to respond to. I think I will make this my last contribution to this thread since this week is shaping up to be much fuller than I had anticipated. I will look forward, however, to posting an occasional new topic here for some informal peer review, as I appreciate the quality of many of the comments.

    Allan Miller, “If some cataclysm obliterated every species on earth but one, and Kirk came along, oblivious to this, and did his sampling exercise on the one species left, would we argue that this represented a thorough shakedown of protein space?”

    No, although I might be tempted if that taxon survived under a huge range of extreme environments and conditions, and was able to supply a few thousand different sequences. There are some protein families I would love to run through my software, but they simply do not provide a wide enough range of taxa or sequences.

    The ‘adaptive peak’ issue that Allan Miller has been referring to.

    I tend to see functional sequence space with some topography, as probably most of us do, with local maxima and minima. There may be some local maxima that confer a sufficient fitness advantage such that less advantageous sequences are abandoned in favour of those closer to the maximum, however, to look at the sheer thousands of sequences for a protein family, and even the number of different sequences for a given taxon in many cases, has a way of persuading me that a lot of sequences do not offer a high enough selective advantage to cause a significant migration to that local maximum. I speak in generalities, of course, but genetic drift seems to result in a huge variety of sequences and natural selection just weeds out those that are so bad they can’t measure up to the minimum fitness requirements along the boundaries of functional sequence space. For example, several years ago I was looking at 35 different human P53 sequences, almost all of which were thought to be defective. Yet, their carriers live long enough to reproduce and pass them on.

    Allan Miller, “The whole point of my objection to the ‘sequence space challenge’ is that it ignores everything but point mutation.” There are two things that come to mind here.

    a) I see random insertions of fragments into a gene, and deletions, as a very risky thing. To clarify, it is likely to be more damaging to the gene than a mere point mutation. I recall a paper a while back that explored the potential benefits of insertions vs point mutations, that made this point, but I cannot recall the reference.

    b) Alternative splicing is something entirely different and something I would predict from an ID perspective as a way to compress information in the genome.

    c) Starting off with hypothetical ‘primordial’ short proteins is not going to help. It transforms the problem into fewer ‘sites’ but more options per ‘site’. I also have grave doubts about early proteins having been assembled of a fewer number of amino acids. I went over this with Dryden after reading his paper. My own software can be run different for different and fewer options, not just for 20 amino acids, and the results show that it would make things much more difficult to achieve more than simple 3D structures. I did not continue on this line, as I was doing my Ph.D. at the time and my supervisor insisted that I focus on just one narrow area of research.

    A proposed test: There is a paper by Keefe and Szostak that searches sequence space for short proteins that will bind to ATP in vitro. The number they come up with for M(Ex)/N is about 10^-11, which is getting pretty close to what I would expect for short proteins. There are a lot of ATP family and domain options in Pfam, and I am not sure which one to use, since all the Keefe and Szostak protein does in vitro is bind to ATP. If there is one on Pfam whose function and length might be similar, and it had at least a thousand sequences, then I could run it through and see what I got for M(Ex)/N. I would not be surprised if my results are not that far off, although I would be using functional in vivo proteins, whereas the Keefe and Szostak proteins are in vitro, with a lower bar for functionality. Their proteins need about 45 amino acids, but preferably closer to 70, so M(Ex)/N will tend to be higher for shorter sequences.

    Well, I think I need to retire from this discussion at present, but look forward to getting involved in other discussions in the future.

Leave a Reply