How not to sample protein space

Mung has drawn our attention to a post by Kirk Durston at ENV. This is my initial reaction to his method to establish the likelihood of generating a protein with AA permease (amino acid membrane transport) capability.

Durston: “Hazen’s equation has two unknowns for protein families: I(Ex) and M(Ex). However, I have published a method to solve for a minimum value of I(Ex) using actual data from the Protein Family database (Pfam),

Translation: I have published a method to solve for a minimum value of I(Ex) among proteins that presently exist.

I downloaded 16,267 sequences from Pfam for the AA permease protein family. After stripping out the duplicates, 11,056 unique sequences for AA Permease remained.

Translation: I took some proteins that actually exist. I implicitly assume that they are a representative, unbiased sample of all the AA permeases that could exist.

the results showed that a minimum of 466 [think he means 433 – that’s the number he plugs in later anyway] bits of functional information are required to code for AA permease.

Translation: the results show that the smallest number of bits in this minuscule and biased sample of the entire space is 433.

Using Hazen’s equation to solve for M(Ex), we find that M(Ex)/N is less than 10^-140 where N = 20^433.

Translation: starting from my extremely tiny sample of protein space, multiplying up any distortions (eg those due to common origin or evolution) and ignoring redundancy, modularity, exaptation, site-specific variations in constraint and the possibility of anything more economically specified than an existing protein, the chance of hitting a 433-bit AA permease by a mechanism not actually known in biology is – ta-dah! – 1 in 10^140.

441 thoughts on “How not to sample protein space

  1. Allan Miller:

    stcordova,

    Sampling function protein space is like sampling the space of functional locks and keys

    No it isn’t.

    A most uncharitable reading of my intended meaning.

    http://bioinformatics.oxfordjournals.org/content/22/16/2012.short

    Motivation: Protein–protein interaction networks are one of the major post-genomic data sources available to molecular biologists. They provide a comprehensive view of the global interaction structure of an organism’s proteome, as well as detailed information on specific interactions. Here we suggest a physical model of protein interactions that can be used to extract additional information at an intermediate level: It enables us to identify proteins which share biological interaction motifs, and also to identify potentially missing or spurious interactions.

    Results: Our new graph model explains observed interactions between proteins by an underlying interaction of complementary binding domains (lock-and-key model).

    See that Allan! Lock-and-Key model!

    That is one facet of protein space I was referring to, but if you want to mince words and render the most uncharitable reading, that’s up to you, but it’s not like what I said about lock and key doesn’t have mention in literature about proteins.

    It’s not just about catalytic activity but binding as well.

    And even beyond that lock-and-key model, the lock and key serve as a metaphor for arbitrary levels of complexity to achieve the same task. We can use very complex locks and keys or simple ones, the problem for evolution is why excessively complex protein cascades evolved to accomplish tasks that are done more simply by simpler organisms.

  2. DNA_Jock: Also, Frankie’s claim was originally

    Frankie: ATP binding via designed AA sequences Not one of those sequences arose via stochastic processes…

    Which betrays quite the failure to understand what Keefe & Szostak did.

    Is “Mutagenic PCR” stochastic, Frankie?

    Which is why I have him on ignore. He can’t get even the most basic facts right.

  3. Maybe Frankie can explain in his own words the relationship of bits and fits. Also is it better to double toast a bagel on ‘2’ or run it through a higher number once?

  4. In essence, it appears to be physically implausible for the large protein structures we see in biology to have been built up from tiny ancestral structures in a way that: 1) employed only simple mutation events, and 2) progressed from one well-formed structure to another. Simply put, the reason for this is that folded protein structures consist of discrete multi-residue units in hierarchal arrangements that cannot be built through continuous accretion. The material on the outer surface of an accretive structure, such as a stalagmite, is converted to interior material as successive layers are added. For structures of that kind the distinction between exterior and interior is one of time-dependent geometry rather than of substance. By contrast, the process by which proteins fold involves a substantive distinction between interior and exterior that is evident in the final folded form. Since an evolutionary progression from tiny protein structures to large globular ones would have to repeatedly convert exterior surface to interior interface, this means that any such progression would have to coordinate the addition of appropriate new residues with the simultaneous conversion of existing ones. Considering that these structural additions and conversions would both involve many residues, it seems inescapable that one or the other of the above two conditions would be violated. Furthermore, on top of these conditions is the primary consideration in this section- that of function. – Doug Axe

    There is another problem, too. Long polypeptides require chaperones or else they do not find their functional shape.

  5. Translation: I have published a method to solve for a minimum value of I(Ex) among proteins that presently exist.

    Science works with what we have and what we know. We don’t know if there were different proteins for the same function in the past or not. So we start with what we do have and go from there.

    Translation: I took some proteins that actually exist. I implicitly assume that they are a representative, unbiased sample of all the AA permeases that could exist.

    Imagined protein sequences- how do we include those? We have to go with what we do have. That is just a limitation of science.

    Translation: the results show that the smallest number of bits in this minuscule and biased sample of the entire space is 433.

    And it is open to revision with further discovery. It is also open to confirmation with further discovery. That’s science for ya.

  6. stcordova: We can use very complex locks and keys or simple ones, the problem for evolution is why excessively complex protein cascades evolved [in order] to accomplish tasks that are done more simply by simpler organisms.

    You grow ever more bizarre in your comments, Salvador. Care to explain why you’re asserting purpose when ostensibly identifying a problem for a theory of evolution without purpose?

  7. Tom English: You grow ever more bizarre in your comments, Salvador. Care to explain why you’re asserting purpose when ostensibly identifying a problem for a theory of evolution without purpose?

    Two different types of purpose, Tom. Evolutionism- your alleged theory of evolution- isn’t about purpose as in the purpose for life is X. But when it comes to functionality purpose refers to that- the purpose of this enzyme is to do X. That type of purpose evolutionism can deal with

  8. stcordova,

    A most uncharitable reading of my intended meaning.

    Disagreement and ‘uncharitable reading’ are two different things. I am well aware that the term ‘lock-and-key’ appears in the literature, specifically in relation to certain aspects of match between a protein fold and the thing it wraps around or binds with. But it is not necessary that both fold A and Substrate B match at the start. Therefore I simply disagree that sampling protein space is ‘like sampling the space of functional locks and keys’. I don’t buy your metaphor, sorry if that appears ‘uncharitable’.

  9. Frankie,

    Not necessarily but that isn’t even relevant. For DNA to do that requires an existing suite of proteins

    It remains the fact that DNA is the source of protein sequence. I don’t know why you continue to deny this simple point.

  10. Rumraket,

    Which is why I have him on ignore. He can’t get even the most basic facts right.

    I know, I had him on ignore too. I do know better than to respond to bollocks, but it still has some (limited) entertainment value.

  11. My latest favourite thing is the amphipathic alpha helix. If there is asymmetry of hydrophobic and polar residues you get a hydrophobic moment – an asymmetric affinity for membrane and cytosol, leading to membrane binding or pore formation. Where is the lock and where is the key?

  12. Allan Miller:
    Frankie,

    It remains the fact that DNA is the source of protein sequence. I don’t know why you continue to deny this simple point.

    It can’t be for the reasons already given. I don’t know why YOU continue to deny that simple point.

    The whole issue with ID and the OoL is the chicken-egg problem-> DNA can’t do anything without all of the required proteins and RNAs and allegedly only DNA can provide those. The only way around it is ID, ie top-down.

  13. Allan Miller:
    Frankie,

    It remains the fact that DNA is the source of protein sequence. I don’t know why you continue to deny this simple point.

    So you have refuted the protein-first hypothesis? Really? How did those scientists who advocate it react?

  14. Using another analogy that I bet Allan also doesn’t care for, buildings are built from blueprints. Take that you stupid construction workers!

  15. Mung,

    Using another analogy that I bet Allan also doesn’t care for, buildings are built from blueprints. Take that you stupid construction workers!

    Er … are you sure you know what an analogy is?

  16. I recognized, seven or eight years ago, that Joe G’s command of the facts and powers of ratiocination were vastly superior to my own, and resolved never again to embarrass myself by challenging them.

  17. You grow ever more bizarre in your comments, Salvador. Care to explain why you’re asserting purpose when ostensibly identifying a problem for a theory of evolution without purpose?

    I don’t think my statements are at all bizarre. It is evident multicellular Eukaryotic organisms are more complex than needed to reproduce than single cell prokaryotic organisms.

    Why do multi-cellular organisms exist when it is empirically evident they are selected against in the wild in the present day and there is no reason to expect they were ever selected for in the past when competing with simpler organisms.

    Care to explain why you’re asserting purpose when ostensibly identifying a problem for a theory of evolution without purpose?

    I did not intend to assert purpose, but in deference to your objection, how about I reword “to” to “that”?

    excessively complex protein cascades evolved that accomplish tasks that are done more simply by simpler organisms.

    There, the problem still stands no matter how it is worded. Why do such improbable constructs exist? Natural selection isn’t the answer because what happens in nature is selection against such constructs, not for them as evidenced by current extinction and extreme selection pressure being put on species going extinct. Does the extreme pressure and lack of habitats and food for birds seem to make them more complex, or does it simply kill species of birds without replacing them?

    You grow ever more bizarre in your comments, Salvador. Care to explain why you’re asserting purpose when ostensibly identifying a problem for a theory of evolution without purpose?

    I think you’re straining at gnats. That’s ok, I rephrased the problem in wording that should be less objectionable. The problem of an exceptional configuration of chemicals vs. expected configurations still stands.

  18. . I don’t buy your metaphor, sorry if that appears ‘uncharitable’.

    Searching for the space of functional proteins is analogous to searching the functional space of lock and key systems, the space is infinitely large, but an infinitely large space of workable systems doesn’t make complex systems highly probable.

    There are an infinite number of ways to make lock and key systems just as there are conceivably infinite number of ways to make functional protein cascades or Rube Goldberg machines. The existence of an infinite number of solutions doesn’t imply the probability is high that they will exist.

    If you want to name the problem something else, go ahead, but arguing over what it should be called or whether you buy the metaphor I provided or not doesn’t solve the problem of extravagant un-necessary complexity.

  19. stcordova: Why do multi-cellular organisms exist

    I recommend Nick Lane’s The Vital Question: Why is life the way it is?.

    In very short: The endosymbiosis that resulted in mitochondria and a cell nucleus conferred eukaryotes on the order of 200.000 times more energy to expend on synthesizing protein, leaving stochastic change to sample this vastly increased space of possible complexity.

    So basically the selective pressure you refer to as being in effect on bacteria, simply doesn’t exist for eukaryotes.

  20. stcordova,

    Searching for the space of functional proteins is analogous to searching the functional space of lock and key systems, the space is infinitely large, but an infinitely large space of workable systems doesn’t make complex systems highly probable.

    You are talking about something else. You have escalated the sampling of protein space to become ‘sampling’ of the entire proteome of entire organisms. That’s even less like the kind of sampling being discussed here.

    Nobody (who is not a Creationist) proposes that the entire protein repertoire of an organism arises all at once. I’m not even sure what mechanism you think we think would be involved in such a ‘sampling’.

    Evolution does not need to find optimal binding on a first pass. Proteins are capable of extensive tuning and co-evolution.

    It’s not just that it’s a bad metaphor. It is indicative of an incorrect grasp of the mechanism you critique, regardless of how you prefer to frame it. Evolution is not about randomly sampling the space of ‘good fits’.

  21. stcordova,

    Why do multi-cellular organisms exist when it is empirically evident they are selected against in the wild in the present day and there is no reason to expect they were ever selected for in the past when competing with simpler organisms.

    Yes, they are empirically so selected against that there aren’t any. Oh, hang on …

    There are about 16 different multicellular groups, each of independent origin. So while it does not happen every day, it is probably not the most maladaptive amendment ever to stalk the earth. Multicellularity is simply colonialism – the accretion of multiple cells after mitosis instead of their immediate dispersal. By accreting, organisms gain an increase in size, and the opportunity for differentiation and specialisation of cell types. I find it pretty easy to see how that could be advantageous in a world of small, tasty single cells. YMMV.

  22. Frankie- – the point is not whether Kirk has ever had anything peer reviewed, but whether the argument presented at ENV has been. Not that it matters much either way.

  23. One trick missed by the protein-space advocate is the truncation of peptides by STOP. If there were one STOP in the genetic matrix (heh heh – notice how I avoided saying ‘code’ there? Mwahahaha), the mean peptide length in a fully randomised set of bases would be around 63. Very few would get to more than a couple of times that. Our 3 STOPs would give a slice mean of about 20. The more STOPs (which I think was the case in primordial proteins, many more even than our 3) the shorter the mean length becomes.

    The fact that I can point to this issue, and then smile urbanely, should suggest that this too is not the problem it might seem to the bit-wise.

  24. Allan Miller:
    The fact that I can point to this issue, and then smile urbanely, should suggest that this too is not the problem it might seem to the bit-wise.

    No no, that is just your subconscious hatred and fear of god’s judgement clouding your ability to reason correctly.

  25. Rumraket: The endosymbiosis that resulted in mitochondria and a cell nucleus conferred eukaryotes on the order of 200.000 times more energy to expend on synthesizing protein, leaving stochastic change to sample this vastly increased space of possible complexity.

    What does the size of the space have to do with anything?

  26. You are talking about something else. You have escalated the sampling of protein space to become ‘sampling’ of the entire proteome of entire organisms. That’s even less like the kind of sampling being discussed here.

    Ok, I’ll defer to another discussion then. We’ll fight another day. Thanks for your comments.

  27. Allan Miller,

    Evolution does not need to find optimal binding on a first pass. Proteins are capable of extensive tuning and co-evolution.

    This is certainly possible with enzymes but how about nuclear proteins that need to bind to multiple proteins and are extremely mutation sensitive?

  28. Allan Miller:
    Frankie- – the point is not whether Kirk has ever had anything peer reviewed, but whether the argument presented at ENV has been. Not that it matters much either way.

    LoL! His formula and methodology are in that paper.

  29. I read Nick Lane’s book “The Vital Question”- Nothing but untestable speculation based on the assumption.

    Endosymbiosis is nothing more than “well they surely look like they coulda been bacteria to me (referring to mitochondria)”

    There is so much more to euks than mitochondria. But we understand that evos have to say something to try to support their claims

  30. Frankie,

    LoL! His formula and methodology are in that paper.

    ‘That paper’ does not contain the argument presented at ENV – the one I am critiqueing. ‘That paper’ is not about the sampling of AA permease protein space.

  31. colewd,

    This is certainly possible with enzymes but how about nuclear proteins that need to bind to multiple proteins and are extremely mutation sensitive?

    I don’t know where you get ‘need’ from. An equivalent argument can be applied to any co-evolved pair of binding sites. Initial binding may be weak, subsequent modifications of either or both partners lead to a closer interaction. This pins both in place, and renders them ‘mutation sensitive’. But present mutation sensitivity can hardly be taken as evidence of lack of plasticity throughout history.

    I refer again to my model domain, the amphipathic alpha helix. Gradual asymmetry by substitution of hydrophilic and hydrophobic sites leads to membrane binding. Membrane siting proves beneficial, giving an advantage over cytosolic location. Further amendments are made in light of the fact that the protein is membrane bound. Eventually, this state becomes non-optional.And we marvel how such a state could arise ‘all at once’.

  32. Mung: What does the size of the space have to do with anything?

    Heh.

    Seriously though I phrased that pretty poorly. I think I should have written:
    “The endosymbiosis that resulted in mitochondria and a cell nucleus conferred eukaryotes on the order of 200.000 times more energy to expend on synthesizing protein, leaving stochastic change to sample a newly opened space of vastly increased complexity.”

    Nick Lane spends most of the book fleshing out why this was so. To expand a bit on what I already wrote, the increased capacity for protein generation would result primarily from the larger membrane surface area dedicated to ATP synthesis that came with the evolution of mitochondria. As you probably know, mitochondria have a very large folded membrane that takes up most of their volume. This gives an astonishing number of ATP-synthases to produce ATP.

    As the capacity for ATP generation grew over time (a host of selective pressures drove this tendency forward, all of them explained in the book), it allowed vastly increased genome sizes because the eukaryotes now had the energy supply to produce many more proteins, both in the number of novel protein coding genes their genome could contain, but also in the amount of intracellular protein they could synthesize, giving rise to such things as both intra and extra-cellular structures that would eventually come to be some of the necessary preconditions for multicellularity.

    Sal is speaking of an evolutionary conondrum that has an actual answer in the literature. He just needs to read it.

  33. Allan Miller:
    Frankie,

    ‘That paper’ does not contain the argument presented at ENV – the one I am critiqueing. ‘That paper’ is not about the sampling of AA permease protein space.

    Same formula, same methodology, different proteins. But I understand why you would want to avoid it.

    Can you show that his sample space is wrong? No

    Do you have any evidence that there were other AAp’s in the past? No

    Can you show that stochastic processes can produce any AAp? No

  34. Allan Miller,

    I don’t know where you get ‘need’ from. An equivalent argument can be applied to any co-evolved pair of binding sites. Initial binding may be weak, subsequent modifications of either or both partners lead to a closer interaction.

    An example of need here is the cell cycle. This is a requirement for first life and evidence shows that it requires extreme precision. Experiments show that single mutations can shut down the proper interaction of proteins that control this cycle. I am very skeptical of the simple to complex story although your arguments regarding helixes are interesting. Is life possible without a high level of precision?

  35. colewd,

    An example of need here is the cell cycle. This is a requirement for first life and evidence shows that it requires extreme precision.

    In modern cells it ‘requires’ that which it has in modern cells. This has been tuned for 3.8 billion years. How can you be so certain it was always so? Do you not accept the possibility that things can be tuned?

    Experiments show that single mutations can shut down the proper interaction of proteins that control this cycle.

    1) Citation please.

    2) Single mutations at certain sites are indeed not tolerated in some – indeed many – proteins. This can be as readily due to co-evolution or subsequent evolution of downstream dependence as primitive ‘need’. On what basis do you dimiss these perfectly reasonable logical possibilities for the modern facts?

    I am very skeptical of the simple to complex story although your arguments regarding helixes are interesting. Is life possible without a high level of precision?

    Yeah, why not? At least, there is a perfectly rational possibility that the modern level of precision is the result of 3.8 billion years of evolution, ie serial extinction of the less precise.

  36. Allan Miller,

    In modern cells it ‘requires’ that which it has in modern cells. This has been tuned for 3.8 billion years. How can you be so certain it was always so? Do you not accept the possibility that things can be tuned?

    Too bad you cannot support the claim of a more primitive cell. But hey the lack of evidence has never stopped you

  37. Frankie,

    Can you show that his sample space is wrong? No

    Yes. Did so above. He assumes evolution. If you do so, that has implications that must be addressed. If you assume evolution, you have to assume the whole of it, which includes extinct proteins and common descent. And on that basis, yes his sample space is wrong, because it only includes modern organisms. That is not an unbiased sample of the space – regardless how those modern organisms came to exist, but particularly if you are trying to show limits on evolution, rather than being a robo-denialist like yourself.

    You can’t simply ignore evolution in a method designed to show its limitations. As you try to do do here:

    Do you have any evidence that there were other AAp’s in the past? No

    ie, if one cannot find an extinct ancestral sequence, there was no such thing? Which would rather have Durston wasting his time if that was accepted up front; I’m sure he’s grateful for your contribution.

    What is the point of Durston sampling protein space in order to find a limit on evolution if the proteins – every single one of them – are argued, absent satisfactory evidence to the contrary (sequence alignment being insufficient for the likes of you), to have simply popped into existence anyway?

    He is investigating evolution. Unlike you, he does not simply assume the conclusion – that if no-one has any evidence that permeases are genetically related (haha) then they must all have been created anyway. Like, yesterday, for all we know. He is thinking a bit better than that.

    Can you show that stochastic processes can produce any AAp? No

    Your ‘go-to’ argument. But irrelevant. I don’t have to show stochastic processes doing anything. Durston is assuming that living things have sampled protein space, by whatever mechanism. My argument is with his method of sampling, of using modern proteins as a proxy for functional density in the ‘real’ space.

  38. Frankie,

    Too bad you cannot support the claim of a more primitive cell. But hey the lack of evidence has never stopped you

    I see no good reason to dismiss the possibility.

  39. Not to mention the fact that phylogenetic reconstructions strongly indicate simpler cells around the time of LUCA and before. Not simple cells to be sure, we’d probably still recognize them as small bacteria similar to ones we find in hydrothermal vents, but simpler than what we see today 3.5 billion years down the line. Naturally the very fact of a universal common ancestor makes phylogenetics beyond that point extremely difficult and rare.

    So it’s not that there is zero evidence for a simpler stage of cellular life, there is some but not much and at such extreme ages it is watered down by a lot of uncertainty.

    For example we can be pretty confident there was a stage of life before DNA-based genomes, when the primary genetic material was RNA rather than DNA. There a host of indications that point in that direction. The mechanisms of biosynthesis of the constituents of DNA (the individual nucleotide triphosphates), for example, are all extensions to the pathways for biosynthesis of RNA.

    Of interest (at least to me) is when Ancestral Sequence Reconstruction is used to try to determine how the most ancient versions of key proteins in metabolism and replication looked and worked. Again of note is that such reconstructions imply that more ancient life had fewer but more promiscous enzymes involved in metabolism. Also they almost universally seem to converge on a hyperthermophilic prokaryote of some sort. It is remarkable that for example the Thornton lab have even been able to reconstruct some of the steps in the evolution of ATP synthase (and thereby demonstrating that the ATP syntase itself is not only an evolved entity, but that a simpler version existed billions of years ago).

  40. Rumraket: I recommend Nick Lane’s The Vital Question: Why is life the way it is?.

    Me too! Optimistic, but a good read.

  41. Frankie: I read Nick Lane’s book “The Vital Question”- Nothing but untestable speculation based on the assumption.

    🙂

  42. Hey, I can never fully express how thankful I am Frankie is an ID proponent rather than an evolutionist. We just don’t seem to have anyone like him on “our side”. No other person has done more to make ID look ridiculous, perhaps with the exception of Robert Beyers (and I sometimes suspect Beyers is a Poe).

  43. Allan Miller,

    Yeah, why not? At least, there is a perfectly rational possibility that the modern level of precision is the result of 3.8 billion years of evolution, i.e. serial extinction of the less precise.

    I need to think about this but currently I am skeptical that we have identified a mechanism that can consistently tune at all. What is the mechanism that you think does this?

Leave a Reply