How not to sample protein space

Mung has drawn our attention to a post by Kirk Durston at ENV. This is my initial reaction to his method to establish the likelihood of generating a protein with AA permease (amino acid membrane transport) capability.

Durston: “Hazen’s equation has two unknowns for protein families: I(Ex) and M(Ex). However, I have published a method to solve for a minimum value of I(Ex) using actual data from the Protein Family database (Pfam),

Translation: I have published a method to solve for a minimum value of I(Ex) among proteins that presently exist.

I downloaded 16,267 sequences from Pfam for the AA permease protein family. After stripping out the duplicates, 11,056 unique sequences for AA Permease remained.

Translation: I took some proteins that actually exist. I implicitly assume that they are a representative, unbiased sample of all the AA permeases that could exist.

the results showed that a minimum of 466 [think he means 433 – that’s the number he plugs in later anyway] bits of functional information are required to code for AA permease.

Translation: the results show that the smallest number of bits in this minuscule and biased sample of the entire space is 433.

Using Hazen’s equation to solve for M(Ex), we find that M(Ex)/N is less than 10^-140 where N = 20^433.

Translation: starting from my extremely tiny sample of protein space, multiplying up any distortions (eg those due to common origin or evolution) and ignoring redundancy, modularity, exaptation, site-specific variations in constraint and the possibility of anything more economically specified than an existing protein, the chance of hitting a 433-bit AA permease by a mechanism not actually known in biology is – ta-dah! – 1 in 10^140.

441 thoughts on “How not to sample protein space

  1. colewd: The sequential space is too large and the density of solutions is too small by all the data I have looked at.

    Prove it. Show your math.

    What is the density of solutions? What data implied this? If the density of solutions is as low as you imply, how come the result of the Hayashi et al. paper I referenced?

  2. Moved a couple of comments to Guano. Address the post, not the poster. Discussion of moderation issues belong in the Moderation Issues thread.

  3. colewd,

    What mechanism is allowing these punts to work?

    The fact that the proteins are viable members of the space in both (or several) cases, of course.

    What mechanism allows exon shuffling to work but not evolutionary change of the same type and magnitude?

  4. colewd,

    As far as alternative splicing goes we don’t understand where the codes are coming from so its too early to make a call but because alternative splicing errors are so detrimental a trial and error process is highly unlikely the cause.

    Wrong, actually. A trial and error process by a ‘selfish intron’ can easily cause subdivision into exons. There is strong selection against inserts that disrupt the gene and do not come out cleanly, because they kill the host and hence themselves.

    Regardless, what about that fraction of the tree of life (the major part) that has no alternative splicing?

  5. The full set of even a 102-acid peptide would take a mole of universes (6 x 10^23) if there were a single example of each, actual size. Shame on the paper for not showing such a space in its entirety.

  6. colewd,

    There is an alternative answer to the designer and my favorite. We have no fn idea….

    Just because you have no fn idea does not mean no-one has.

    better than a bunch of “just so” stories that are misleading to science.

    Things I Learnt At Mama’s Knee pt 10. When all else fails, say ‘just so story’.

  7. Frankie:
    ID does not require a designer to initiate mutations. Genetic algorithms do not require programmers to intervene to make changes and drive them towards the solution.

    So ,per ID, mutations might be caused by a designer or might not, no way to tell.

  8. Allan Miller: A trial and error process by a ‘selfish intron’ can easily cause subdivision into exons.

    A trial and error problem solving search process?

  9. I’d love it if Arthur Hunt would come by and comment on this paper that colewd keeps waving about like a pamphlet. I’m pretty sure Hunt isn’t saying ‘evolution isn’t possible’.

    This is his own summary:

    “To summarize, the claims that have been and will be made by ID proponents regarding protein evolution are not supported by Axe’s work. As I show, it is not appropriate to use the numbers Axe obtains to make inferences about the evolution of proteins and enzymes. Thus, this study does not support the conclusion that functional sequences are extremely isolated in sequence space, or that the evolution of new protein function is an impossibility that is beyond the capacity of random mutation and natural selection.”

    So why are we supposed to dissect it?

  10. newton: So ,per ID, mutations might be caused by a designer or might not, no way to tell.

    That doesn’t follow as ID does not require any mutations to be caused by a designer. A GA doesn’t require designer intervention.

  11. Frankie: That doesn’t follow as ID does not require any mutations to be caused by a designer.

    Then it’s odd how concerned you are that mutations have not been shown to be random to your satisfaction.

    Frankie: A GA doesn’t require designer intervention.

    Sure they do. They get written, don’t they?

    Frankie: OTOH you think your belligerent agenda is enough to overthrow everything I and others have been saying for decades.

    You might have been saying it but that does not mean it needs to be overthrown. It’s just you saying it, is all. If it needed to be overthrown, well, we’d not be having this conversation now would we?

    Frankie: It is very telling that when all you have to do to silence ID

    But ID is silent! What’s happened lately? What’s going to happen?

    Frankie: find support for the claims of your position

    When asked to support your position, you ask instead for that. You know nobody is fooled right? If you could support your position you would.

  12. Frankie: Not to mention textbooks, which really just give a cursory nod and very general over-view because their focus is on the actual biology.

    Where was the first factual error and in what book?

  13. Rumraket,

    The question remains regarding how large a population is required to reach the fitness of the wild-type phage. The relative fitness of the wild-type phage, or rather the native D2 domain, is almost equivalent to the global peak of the fitness landscape. By extrapolation, we estimated that adaptive walking requires a library size of 10^70 with 35 substitutions to reach comparable fitness. Such a huge search is impractical and implies that evolution of the wild-type phage must have involved not only random substitutions but also other mechanisms, such as homologous recombination.

    This is consistent with the Hunt paper and the points I have been discussing with Allan. He claims here that evolution is beyond RMNS.

  14. Allan Miller,

    The Hunt paper gives a range of the probability of protein space. Using this as a guide to test the evolutionary mechanisms. I used this paper because of the potential bias of the Axe paper. I do not think his data supports his conclusion in fact I think it almost falsifies it.

  15. Allan Miller,

    Wrong, actually. A trial and error process by a ‘selfish intron’ can easily cause subdivision into exons. There is strong selection against inserts that disrupt the gene and do not come out cleanly, because they kill the host and hence themselves.

    Can you explain the selfish intron process?

  16. colewd:
    Rumraket,
    This is consistent with the Hunt paper and the points I have been discussing with Allan. He claims here that evolution is beyond RMNS.

    No he doesn’t, he claims it is beyond an accumulation of single substitutions. Which is why he mentions, as an example, homologous recombination. Those would still be a type of mutations.
    So would inversions, frameshifts, insertions, deletions, segmental duplications etc. etc.

    A substitution is a mutation where one base or amino acid is replaced with another, a substitute. Hence, substitution.
    Insertions is where one or more nucleotides or amino acids are inserted.
    You can guess what deletions are.
    A frameshift is a deletion or insertion in a coding region that causes the codon reading frame to move.
    A duplication is where something is copied and inserted elsewhere (duplications can also be subject to homologous recombination).

    Random mutation doesn’t mean just “random substitution”. Random mutation means all of the above and even more I don’t remember. What was tested was random substitution and the limits calculated were for random substitutions only, but substitutions are but one type of mutation.

  17. Frankie: At least I made a grand effort to learn about evolutionism by reading Darwin, Dawkins, Mayr, Gould- hundreds of evolutionary biologists.

    Or four using non choo-choo counting.

  18. colewd,

    This is consistent with the Hunt paper and the points I have been discussing with Allan. He claims here that evolution is beyond RMNS.

    Let me reiterate the prophetic point I made 6 days ago now.

    (I feel I may have to make this point several times more yet) neighbouring sequences do not merely consist of those one or two point mutations away from each other. Segments of sequence can be moved by copy/cut and paste, and reciprocal recombination, both within and between ORFs. This makes a massive difference to the number of paths available. Proteins do not appear to have been assembled by N random picks from a 20 acid library, nor modified solely by bit-position substitution.

    “RMNS” is a child’s crayon version of the evolutionary process, beloved of the critic. No, a simple ‘digital’ view where the only change is point mutation and every base is independent and different cannot get you very far. No-one says it does. But sequence swapping and copying (not restricted to exons, but a permanent genetic change) and the significance of gross effect over digital detail makes all the difference in the world.

    I have spent many painstaking posts illustrating this point. My writing is breathtaking in its clarity. No, not really, but I think I have made the point well enough. I’m stumped why it has not made a blind bit of difference, so many days on.

  19. I’m not sure what school grades by effort, or by the number of books owned or read.
    If raw knowledge were the criterion, Agassiz might have been the greatest biologist ever.

  20. Allan Miller: I’m stumped why it has not made a blind bit of difference, so many days on.

    I have no trouble understanding why, but site rules prohibit my discourse on the subject.

  21. Allan:

    I have spent many painstaking posts illustrating this point. My writing is breathtaking in its clarity. No, not really, but I think I have made the point well enough. I’m stumped why it has not made a blind bit of difference, so many days on.

    colewd,

    Are you wearing God goggles, by any chance?

  22. GlenDavidson,

    I had a set of those Darwin goggles but mine got foggy and Allan can’t seem to clear them up for me 🙂 Although he is now recommending a modified version. Will see it they work.

  23. keiths: Are you wearing God goggles, by any chance?

    Please address the post not the poster. Poor Patrick already has his hands full as it is.

  24. Allan Miller,
    Ever since the modern synthesis evolution has been all about differing accumulations of genetic accidents, errors and mistakes. All mutations are random, as in happenstance occurrences. Unfortunately the concept cannot be modelled and is scientifically sterile

  25. Frankie: Unfortunately the concept cannot be modelled and is scientifically sterile

    Best pack up and go home I suppose!

  26. colewd:
    Rumraket,

    Try this
    https://aghunt.wordpress.com/…/axe-2004-and-the-evolution-of-enzyme.

    Nothing in that essay supports your claim that: “The sequential space is too large and the density of solutions is too small by all the data I have looked at.”

    On the contrary. It argues the diametrically opposite. So either you didn’t actually look at any data, or you looked at something else and not this, or you looked at this but you really were wearing god-goggles. Which one is it?

  27. colewd:
    Allan Miller,
    Does Rumraket agree with this?

    Yeah when creationists use the term random mutation, they usually exclusively imagine substitutions (or in other cases just single-nucleotide polymorphisms) and they have a whole host of misconceptions about what the word “random” means when referred to mutations. And in a sense it also leaves out genetic drift.

  28. Mung, can I ask what the purpose of your posts in this thread are?. It honestly reads like you have no idea about any of the subjects but you know who’s “side” you’re on.

    I like how most of your posts are just an occasional drop in to throw some lame one-liner about semantics (is it a “code”? Is it a “search”?)

    It’s similarly clear that colewd also doesn’t have much of a clue about the subject, but at least he’s trying to actually engage with it at an intellectual level.

  29. Rumraket: Mung, can I ask what the purpose of your posts in this thread are?. It honestly reads like you have no idea about any of the subjects but you know who’s “side” you’re on.

    Do you know what it means to address the post not the poster? If you cannot abide by the rules perhaps this is not the site for you.

  30. Rumraket,

    Nothing in that essay supports your claim that: “The sequential space is too large and the density of solutions is too small by all the data I have looked at.”

    On the contrary. It argues the diametrically opposite. So either you didn’t actually look at any data, or you looked at something else and not this, or you looked at this but you really were wearing god-goggles. Which one is it?

    I am looking at the range he quoted to 10^10 to 10^64 for a 100 AA enzyme. These numbers make current evolution highly improbable. He talks about the low end of 10^10 for a bacteria and given bacteria populations the enzyme could be created but when you get to multicellular life even the low end here makes transitions highly unlikely. This work is for single surface enzymes. What happens with multi surface interacting nuclear proteins? The same issue comes up in the paper you provided.

  31. I see a lot of comments and some great points to respond to. Again, time prevents me from responding to everything, so I’ll take some of Allan Miller’s points, since he was the original poster, but I may respond tangentially to some of the other points:

    The effect of removing insertions: Contrary to the assertion made by DNA_Jock, if I do not strip out insertions, M(Ex)/N does not go through the roof. It does quite the opposite … it drops closer to zero for a couple reasons. This isn’t just theory; I can actually see the results when I run a protein family through my program, gradually trimming out the insertions.

    What about the density of stable functional folds in overall sequence space? My method gives only an upper limit for M(Ex)/N for one functional protein family. If I wanted to estimate M(Ex)/N for all protein families that have a stable 3D structure, both functional and those that don’t happen to have a biological function, I would take the estimated total number of possible protein folds (I have a paper here that puts it at around 30,000 different folds), assume 3 structural domains per average protein family, to give about (30,000)^3 different protein families, classified by structure. I would then take that number and multiply it by the average M(Ex) per family to arrive at the total number of different sequences in 300 aa sequence space that will produce a stable 3D structure, both functional and non-functional. That number would be roughly (30,000)^3 x 10^65. That is a large number of possible 3D structures, but still occupies a vanishingly small portion of sequence space.

    Does my method give a good estimate as to M(Ex)/N for the folding sequence space of a given protein family? This seems to be the central question that Allan Miller is hesitant to accept, so let me make a few more points on this.

    (1) Just to make sure everyone understands this, the number of input sequences in the multiple sequence alignment (MSA) is not the value I use for M(Ex). Instead, M(Ex)/N is solved for, once I have estimated the upper limit for I(Ex) from the data.

    (2) Increasing sample size adds new sequences only, not a mix of old and new.

    (3) To understand why I(Ex) vs number of sequences gives us an accurate idea of when we have sufficiently sampled sequence space, I think it is helpful to point out that this is a universal method that has many applications. To understand how it works, let us take the simplest case where there is only one functional sequence discovered by biological life and it begins to mutate outward in all directions from that point in sequence space, such that the circumference of the discovered sequence space is proportional to 2πr. As the evolutionary search moves outward, discovering new functional sequences, the number of novel functional sequences will be π((r’’)^2 – (r’)^2). Now here is the important point … if evolution has not adequately explored the stable folded sequence space for that protein family yet, then the number of novel functional sequences should be increasing according to N^2 while my sample size increases by only N, resulting in curve that has a significant negative slope, nowhere near a horizontal asymptote.
    (4) My program also outputs a list of all the different amino acids that appear at each site in the MSA. I noticed early on my research program that it appears as though evolution has had enough time to try all 20 amino acids at each site, at least for shorter proteins. That does not mean it has had time to try every possible sequence, of course. For example, for a 100 residue protein, it requires only 2000 mutations to try every amino acid at every site, but it would require 20^100 mutations to try every sequence in 100 aa sequence space. There will never be enough time for that. So the only way to estimate an upper limit for I(Ex) is to compute I(Ex) on the bases of the probability of each amino acid at each site, with the assumption that if a given site is critical for the 3D fold, it will skew the aa frequencies at that site. If a different site provides no functional input to the 3D structure, then we should expect to see all 20 aa’s at that site occurring with approximately equal probability, provided our sample size is large enough. If, however, that site is in a higher order relationship with another site, then we should expect that to affect the frequency of occurrence of each aa at that site. So the way to estimate I(Ex) is to compute the Shannon information using the frequency of each amino acid at each site. As sampling of functional sequence space begins to hit against the boundary, the change in aa frequency for each site will begin to stabilize, resulting in little change as new sequences are added, for the reason that the new sequences will tend to conform to the same aa frequencies already observed. At this point, the I(Ex) vs sample size curve will begin to approach a horizontal asymptote, indicating that the aa frequency space for each site in the sequence has been adequately sampled.

    (5) To better visualize how the aa frequency affects I(Ex), please select Figure 3 here Looking at Figure 3, if you only have one sequence to consider, you will get a flat line right across all sites at 4.3 Functional Bits (Fits). As you add new sequences, one always observes the values decrease at all sites, but some much faster than others. Eventually, say, after one has a thousand or more unique sequences, you will begin to observe the ‘signature’ of that protein family begin to stabilize, with very little change as you add more novel sequences. This is because the novel sequences being added are essentially ‘filling in’ the gaps between earlier discovered sequences and, as such, fall into the same frequency distribution. It also means that evolution is bumping up against the border of functional sequence space, thus failing to provide sequences with new frequency distributions.

    Testing At the end of the day, we need a method to test whether evolution is ‘bumping up against’ the boundary of folding sequence space for a given protein family. I use a method that works, not just in this application, but in a wide variety of applications. I certainly would not claim, however, that it is the only test. If one wishes to entertain the hypothesis that evolution is not being significantly constrained by the boundaries of folding sequence space for a given protein family, then one must test that hypothesis. I fully understand all the concerns expressed here, but in science we need a test. I have a test and it indicates that evolution has had time to sample the boundary of folding sequence space for a protein family that provides a few thousand unique sequences across a wide taxanomic range. No one else had produced a different test to verify or falsify my claim. So proper scientific procedure is to produce a different test that is as good or better than mine. Without such a test, all the worries expressed by others, as legitimate as they may seem, are unsubstantiated.

  32. A fundamental question in biology is the following: what is the time scale that is needed for evolutionary innovations? There are many results that characterize single steps in terms of the fixation time of new mutants arising in populations of certain size and structure. But here we ask a different question, which is concerned with the much longer time scale of evolutionary trajectories: how long does it take for a population exploring a fitness landscape to find target sequences that encode new biological functions? Our key variable is the length, L, of the genetic sequence that undergoes adaptation.

    here

    Why would L matter? That sounds too much like the size of the sequence apace.

    And finding target sequences that encode new biological functions? That sounds too much like evolution is a search.

  33. Mung: here

    I like that paper a lot, it basically reinforces the same point Allan has been making for a while now. That mere substitutions are not enough to make evolution work, but with duplication, recombination, exon shuffling and so on it is.

    Mung: Why would L matter? That sounds too much like the size of the sequence apace.

    No, L is the length of some particular sequence.

    “Our key variable is the length, L, of the genetic sequence that undergoes adaptation.”

    Mung: And finding target sequences that encode new biological functions? That sounds too much like evolution is a search.

    You can call it what you want. I’m fine with the search-metaphor.

  34. Kirk,

    Testing At the end of the day, we need a method to test whether evolution is ‘bumping up against’ the boundary of folding sequence space for a given protein family.

    Maybe, but I’m afraid this isn’t it.

    I use a method that works, not just in this application, but in a wide variety of applications. I certainly would not claim, however, that it is the only test. If one wishes to entertain the hypothesis that evolution is not being significantly constrained by the boundaries of folding sequence space for a given protein family, then one must test that hypothesis.

    Hmmmm. That is a tricky one. The organisms produced by 4 billion years of evolution have explored this portion of space, and it is declared bounded (without, I have to say, strong support). The critic who wishes to demonstrate that it is not bounded clearly cannot use the set of existing organisms to do that – the boundary, if it exists, is invisibly outside of the space occupied by current forms. I guess they must wait another 4 billion years.

    I fully understand all the concerns expressed here, but in science we need a test. I have a test and it indicates that evolution has had time to sample the boundary of folding sequence space for a protein family that provides a few thousand unique sequences across a wide taxanomic range.

    They aren’t that unique – there is clearly common descent. If all you did was discard exact duplicates, you still aren’t left with a particularly good sample of the whole space. By taking modern proteins, your space is essentially sampled by speciation and selection, not a thorough exploration at all.

    Say you have one ancestral sequence, and the lineages descending from it branch once every 10 million years, with no extinctions. All sequences change at least once between bifurcations. After 100 million years you have had 10 bifurcations and hence 1024 unique sequences. Would you seriously contend that those sequences represented a thorough probing of the possible sequence space for this particular version of this particular function? There are only 1024 of them because there have been 10 bifurcations. If there had been none, you’d have 1. So your sample is hugely conditioned by the rate of cladogenesis and extinction.

    There may o

  35. [Gah, comment got chopped and I can’t edit …

    As I was saying… ]

    There may or may not be a lot of visitation of other sequences, within populations, but their preservation depends on the number of lineages surviving. And the amount of ‘visitation’ is itself dependent on the amount of purifying selection in operation, which is also buried in the family history of the surviving few.

    No one else had produced a different test to verify or falsify my claim. So proper scientific procedure is to produce a different test that is as good or better than mine. Without such a test, all the worries expressed by others, as legitimate as they may seem, are unsubstantiated.

    If your method is flawed, it’s flawed. It is not necessary to produce a ‘better test’ to point out those flaws.

  36. The most direct test currently technologically feasible is direct experimental sampling ala the Szostak lab and Hayashi papers I have linked (and the Hayashi paper further avoids the criticism Kirk leveled at the Szostak paper by conducting experiments with in vivo phage infectivity, rather than an arbitrarily decided in vitro small molecule binding function).
    They still don’t allow us to actually estimate the average density of all possible function in all of amino acid sequence space (because they still only test for one function, in a small sample, in one environment), but they do avoid the flaw Allan points out regarding the bias from common descent.

    And it is interesting to note that once that bias is eliminated, the frequency of function moves up from estimates like ^1 in 10^77 (Axe et al 2004) or 1 in 10^140 (Durston 2016), by several tens of orders of magnitude to ranges around 1 in 10^11 to 1 in 10^15.

  37. It’s been a while since I linked this

    Just take a binary patterning algorithm that reliably folds into alpha helixes, and a four-residue ‘turn’, and functional proteins are pulled out of that quasi-random space like plums. The total space of such peptides is vast – more, as I have said before, than you could get in a sphere 4 AU across. It would be so big it would ignite nuclear fusion … And the authors took just 1.6 million of them, a tiny sample, at random, less than you could get on a pixel in this full stop.

    This particular algorithm generates 4-helix bundles, but there are all sorts of other possibilities. It should be pretty easy to see how one of these structures could be generated from a much smaller space, and four plugged together.

    You only need a toe-hold in folding space. And it seems not to matter hugely where you start.

Leave a Reply