How not to sample protein space

Mung has drawn our attention to a post by Kirk Durston at ENV. This is my initial reaction to his method to establish the likelihood of generating a protein with AA permease (amino acid membrane transport) capability.

Durston: “Hazen’s equation has two unknowns for protein families: I(Ex) and M(Ex). However, I have published a method to solve for a minimum value of I(Ex) using actual data from the Protein Family database (Pfam),

Translation: I have published a method to solve for a minimum value of I(Ex) among proteins that presently exist.

I downloaded 16,267 sequences from Pfam for the AA permease protein family. After stripping out the duplicates, 11,056 unique sequences for AA Permease remained.

Translation: I took some proteins that actually exist. I implicitly assume that they are a representative, unbiased sample of all the AA permeases that could exist.

the results showed that a minimum of 466 [think he means 433 – that’s the number he plugs in later anyway] bits of functional information are required to code for AA permease.

Translation: the results show that the smallest number of bits in this minuscule and biased sample of the entire space is 433.

Using Hazen’s equation to solve for M(Ex), we find that M(Ex)/N is less than 10^-140 where N = 20^433.

Translation: starting from my extremely tiny sample of protein space, multiplying up any distortions (eg those due to common origin or evolution) and ignoring redundancy, modularity, exaptation, site-specific variations in constraint and the possibility of anything more economically specified than an existing protein, the chance of hitting a 433-bit AA permease by a mechanism not actually known in biology is – ta-dah! – 1 in 10^140.

441 thoughts on “How not to sample protein space

  1. Adapa: ID for biological life has no methodology, no evidence, but it does have plenty of professional liars and con men.

    Not to mention quite a few amateurs!

  2. Frankie: Yes, actual selection is telic. However natural selection is really just a process of elimination. And as Mayr explained the two are very different

    I though GA simulated directed evolution, why wouldn’t changing the parameters in nature in the real world be the same, isn’t that what we do in human selection?

  3. Frankie: Yes, actual selection is telic.

    So the designer’s job is to determine which of two mutations is better?

    Presumably we can thank your designer for the rise of drug resistant infections! The Intelligent Designer, selecting so your infections are more likely to kill you!

  4. Frankie: We will get to those questions once ID is accepted and has the full resources at its disposal.

    Once it is accepted by who? Who is the authority figure you look to? Who will decide when ID is accepted?

    What resources would you use? Why can’t you get started now? There is plenty of money available, all you have to do is ask!

  5. Seems like it’s catch 22 for ID then! People won’t be convinced until the research is done that will only be done when people are convinced!

    If only there was some way to break that cycle!

    But no, I guess there is not! If only some brave soul were to perform the required work. Perhaps in their own back yard! Using materials to hand, such as watermelons and ticks! Until such a hero appears we’ll have to simply wait….

  6. newton: I though GA simulated directed evolution, why wouldn’t changing the parameters in nature in the real world be the same, isn’t that what we do in human selection?

    Humans perform selection according to a perceived need or goal, looking forward into the future. Natural selection is not supposed to be like that.

  7. Allan Miller,

    Not if they are purged (by selection) at a sufficient rate. If (at the extreme) all mutations to a given sequence are lethal, copies of that sequence can still be churned out ad lib, provided that N is not too small with respect to L. All the failures are dead. Not a problem for evolution.

    This is a good counter except it is not what is being observed or described. Since delirious mutations are so prevalent the process is described as first being a gene duplication then mutation to function. We know that currently modern gene mutation is repaired through the cell cycle.

    I do believe evolution occurs with single mutations that benefit survival. Lenski did show this. This is, however, very different then the process described above. We see new specie that look simular with different splicing sequences gene expressions etc. The sequence dependence of DNA and layered complexity like splicing and gene expression makes the origin of all these species perhaps as difficult to explain as the origin of life.

  8. colewd: layered complexity

    Is that something you’d expect from design or evolution?

    Human designers strive for simplicity and separation of concerns. None of that is seen in biology.

    So while it may well be difficult to explain the origin of such systems, the evidence seems to be against design as their origin.

  9. colewd,

    This is a good counter except it is not what is being observed or described. Since delirious mutations are so prevalent the process is described as first being a gene duplication then mutation to function.

    There are numerous different possible amendments. Obviously, when a ‘novel function’ first arises it is, in the general view, likely to be a rather coarse approximation of some more optimal version. That better version is attained by mutation of the actual sequence, not duplication and trying again. The space accessible from it has both beneficial and deleterious members. The former tend to outcompete the latter. But the closer to an available optimum the process gets, the more mutations from it are ‘downhill’ – the proportion of beneficial to deleterious changes as a result of selection. By blinding oneself to the implications of an evolutionary scenario, one develops a false picture of the historic proportions of beneficial to deleterious of that sequence.

    We know that currently modern gene mutation is repaired through the cell cycle.

    And yet there is a measurable mutation rate.

    The sequence dependence of DNA and layered complexity like splicing and gene expression makes the origin of all these species perhaps as difficult to explain as the origin of life.

    I think having a view of species as having an ‘origin’ is a common error. I know some bloke wrote a book with that title. But really, there is no day on which a ‘new species’ could be declared, when following a time series of changes, fixations and divergences. There is a day when a member of the last potentially intercombatible pair dies, in a divided sexual species, but you’d never know when it arrived, and it would hardly be an ‘origin’ – the new types are already well established and separate long before that.

  10. OMagain,

    Human designers strive for simplicity and separation of concerns. None of that is seen in biology.

    Interesting point. The layered complexity of this system DNA….Splicing….Gene expression are, among other things, methods to form different species. Alternative splicing makes the change in DNA sequences simpler i.e. you can get a different protein by shifting groups of DNA sequences (exons) so the architecture is more complex but it simplifies the change process. Interestingly enough the architecture of DNA, gene expression and splicing look like the english language. DNA generates words (exons). Alternative splicing generates sentences (genes micro RNA’s etc). Since we have never designed anything as complex as a single celled organism de novo it is hard to judge this but makes for an interesting discussion.

  11. colewd,

    Alternative splicing makes the change in DNA sequences simpler

    Isn’t it interesting that shuffling exons every which way produces functional sequence like that? You’d have thought such gross change was prohibited by the functional sparseness of protein space …

  12. Allan Miller,

    There are numerous different possible amendments. Obviously, when a ‘novel function’ first arises it is, in the general view, likely to be a rather coarse approximation of some more optimal version. That better version is attained by mutation of the actual sequence, not duplication and trying again.

    How does this process find a de novo gene?

  13. Frankie: Those questions come AFTER and are not required to determine design exists and then to study it.

    Once Design in biological life is determined what will study of the design tell us?

  14. colewd,

    How does this process find a de novo gene?

    Whats a de novo gene? How different from an existing sequence/function would it have to be to be de novo?

  15. Allan Miller,

    Isn’t it interesting that shuffling exons every which way produces functional sequence like that? You’d have thought such gross change was prohibited by the functional sparseness of protein space …

    Interesting point. Have you thought about the fact the exons create a secondary structure of sequences that are arranged by alternative splicing. As long as you know the code it is all quite straight forward but if you don’t you are unlikely to have positive change. If the code is known a new specie is a chip shot. BTW did an english native win the masters?

  16. newton: No Frankie, you are asking those questions now. You need to holdyour theory to the same standards or every criticism you level is a criticism of ID. If evolution is untestable so is ID if natural causes must be eliminated to assume design. That is yournon testable hypothesis,right?

    Wow- your position claims to have step-by-step processes for producing the diversity of life down to the systems and subsystems. I am merely asking evos to tell us how to test that claim. ID doesn’t make such a claim so ID doesn’t need to respond to it

    ID claims to have a step-bt-step methodology for determining intelligent design exists and we have shared it.

    We can eliminate evolutionism for the mere fact that it cannot be tested.

  17. OM:

    Human designers strive for simplicity and separation of concerns

    And yet we produce very complex and intricate designs all of the time

  18. Evolutionism for biological life has no methodology, no evidence, but it does have plenty of professional liars and con men. And plenty of amateurs- just look around

  19. OMagain: Human designers strive for simplicity and separation of concerns. None of that is seen in biology.

    In biology we find cell types and organs, and systems like the immune system. Within individual cells we likewise have separation of concerns.

    So yes, that is seen in biology.

  20. Frankie, once Design in biological life is determined what things will study of the design tell us? What possible new knowledge will we get?

    I’d appreciate an answer.

  21. Hey Adapa, I would appreciate it if you held your breath while you wait for an answer,

    Why does archaeology exist? Why does forensic science exist? You can answer your question by answering those

  22. Frankie:
    Hey Adapa, I would appreciate it if you held your breath while you wait for an answer.

    So there’s nothing more to be learned by studying the design? That’s your official position?

    Why does archaeology exist? Why does forensic science exist? You can answer your question by answering those

    Those sciences study phenomena to determine the answers to questions of when, where, how, by who, by what physical mechanisms did the events occur. But ID does none of that, right?

  23. LoL So when forensic science and archaeology can’t answer those questions does it mean there wasn’t a crime or artifact?

    ID isn’t about the who, how, when and where but post-ID will take those on if someone sees fit.

  24. Frankie:

    ID isn’t about the who, how, when and where but post-ID will take those on if someone sees fit.

    It’s been over 15 years since ID claimed to have detected the intelligent design of the bacterial flagellum.

    What has ID learned by studying the flagellum in those 15 years?

  25. It’s been decades since evos baldly declared all flagella evolved via stochastic processes and yet no one can say anything about the how nor the when.

    Focus on your own position, Adapa.

  26. Frankie:

    Focus on your own position, Adapa.

    I’m asking about your position Frankie. Looks like you are admitting ID is scientifically worthless and hasn’t learned a single thing from studying the “design” in over 15 years.

  27. colewd:
    Allan Miller,

    Interesting point.Have you thought about the fact the exons create a secondary structure of sequences that are arranged by alternative splicing.As long as you know the code it is all quite straight forward but if you don’t you are unlikely to have positive change.If the code is known a new specie is a chip shot.BTW did an english native win the masters?

    I would love to hear how stochastic processes can produce a process like alternative splicing, which obviously takes planning and forethought.

    How about you? Allan, can you help us with that or is that also a sore subject?

  28. Frankie:

    As I keep saying- It is very telling that when all you have to do to silence ID is to step up and find support for the claims of your position

    Science has no need to silence religiously motivated pseudoscience movements like ID which have nothing to offer but lies and empty rhetoric. By your own admission ID is scientifically worthless and can’t answer even the most basic scientific questions of who, how, when and where.

  29. Mung:
    Can Adapa tell us the right way to sample protein space?

    The way evolution does it seems to work just fine. Sample the space immediately around a working protein, keep the neutral or beneficial changes.

    Can Mung tell us how his Magic Designer knew in advance how to POOF the exact working proteins he wanted when designing them?

  30. colewd,

    Have you thought about the fact the exons create a secondary structure of sequences that are arranged by alternative splicing

    I don’t know what you mean by ‘secondary structure’, so probably not.

    But my point is that a bitwise approach would declare a successful reorganisation of any successful string impossible, or at least highly improbable. A 200-bit sequence has a particular probability (according to the bit-obsessed). Any substring reshuffle of that sequence visits a completely different point in that space, whose prior probability is the same. And yet both these punts work. Does that tell us anything about protein space, the appropriateness of just counting bits, and the importance of including rearrangements in our analyses, and not just point mutation?

  31. Frankie: I would love to hear how stochastic processes can produce a process like alternative splicing, which obviously takes planning and forethought.

    I would love to hear how Intelligent Design processes produced a process like alternative splicing.

  32. Frankie: It’s been decades since evos baldly declared all flagella evolved via stochastic processes and yet no one can say anything about the how nor the when.

    There are many papers published on this. There are zero papers published from an ID perspective on this.

    That you’ve not read them does not mean they don’t exist.

  33. colewd,

    De Novo
    You could look at any gene with a unique sequence that is identified in humans and not in chimps.

    Why are people so obsessed with chimps? Is there something special about them, as a species … ? 😉

    I think you need to provide an example. I’m not sure where you think the bits of a ‘de novo’ gene would come from, if not prior sequence, in an evolutionary scenario. Obviously, the ancestral sequence may no longer be identifiable, due to 2 lineages’ worth of change over 6-7 million years.

    There is a possibility of a random chunk of DNA being transcribed and translated. That’s not due to point mutation though, but rearrangements. There is also an issue of the cutoff parameter you use in your alignment – an apparent ORFan may prove not to be when you twiddle the knobs a bit.

  34. Mung,

    Can Adapa tell us the right way to sample protein space?

    It’s my way. Should I ever actually have a need to sample protein space, you can bet that will be the right way.

  35. Mung: What is a complex DNA sequence? Define your terms. Give examples.

    Oh give us a break. Dawkins claimed biology is the study of complicated things. Go ask him what he meant.

    I know what he meant, he was talking about organisms in general. But colewd was talking about individual DNA sequences. I want to know the difference between a complex and a non-complex DNA sequence and I want examples.

  36. colewd: Rumraket,

    How can I say this and still believe they evolved? What am I missing?

    They could have evolved, but does this raise your skepticism that the mechanism is the blind watchmaker?

    What does this even mean? We are talking about whether proteins can evolve despite the vast majority of mutations in them being deleterious. Evolutionary biologists have known the solution to this “problem” for over 50 years now at least. Darwin knew the solution at the population-level when he pointed out that many more organisms are born than can possibly survive. That’s basically the same problem. Most won’t make it, whether mutations or individual organisms. But there’s a way out of this conondrum: Have lots and lots and lots of offspring (all of whom would carry mutated versions of the protein) and be subject to natural selection.

  37. Ah, I’ve just twigged why I’m not seeing the latest instalments of Guano. I had a noise filter on! 😉

  38. colewd: I disagree here. With a large sequence space mutations will degrade the neighborhood. The ratio of delirious mutations to helpful mutations causes the sequence to drift toward non function.

    … in the absense of natural selection, yes. But there is natural selection, so there.

    colewd: I have not seen an experiment that can falsify my hypothesis, have you?

    All experiments ever that include selection.

    colewd:Can you show how the Lenski experiment improved an enzyme function through sequence improvement of that enzyme?

    It isn’t known whether part of the fitness improving mutations in Lenski’s LTEE are due to changes in the coding region of enzymes, nor is it required to show that mutations can improve proteins under selection. Here’s an example of such an experiment where this is demonstrated:

    Experimental Rugged Fitness Landscape in Protein Sequence Space
    Yuuki Hayashi, Takuyo Aita, Hitoshi Toyota, Yuzuru Husimi, Itaru Urabe, Tetsuya Yomo
    “Abstract
    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12–130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7×10^4-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman’s n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18–24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region.”

    Creationism is bunk. Mutations being guided by invisible designers simply isn’t needed.

  39. colewd: We know that currently modern gene mutation is repaired through the cell cycle.

    No, we know that mutations happen no matter what. The replication fidelity simply isn’t great enough to totally eliminate mutations.

    colewd: We see new specie that look simular with different splicing sequences gene expressions etc. The sequence dependence of DNA and layered complexity like splicing and gene expression makes the origin of all these species perhaps as difficult to explain as the origin of life.

    Why? Why can’t it happen due to mutations and natural selection?

  40. Mung: Can Adapa tell us the right way to sample protein space?

    Why are you asking for things that have already been answered? There isn’t currently any known way to sample protein space to determine the total density of all functional proteins.

  41. On the ‘if this is the wrong way what’s the right way’ question’, Mung’s New Favourite Thing, it obviously depends on the application. If one is attempting to determine limits near the start of an evolutionary process, one would not do so by taking the fraction of the space occupied at the end of it.

  42. Rumraket: Mutations being guided by invisible designers simply isn’t needed.

    Hahaha. And that paper shows what the landscape of all of protein sequence space is like! OR, how not to sample functional protein sequence space fitness landscapes.

  43. Allan Miller,

    Any substring reshuffle of that sequence visits a completely different point in that space, whose prior probability is the same. And yet both these punts work.

    What mechanism is allowing these punts to work?

  44. Rumraket,

    Why? Why can’t it happen due to mutations and natural selection?

    The sequential space is too large and the density of solutions is too small by all the data I have looked at. Especially as the specify of protein sequences goes up. As far as alternative splicing goes we don’t understand where the codes are coming from so its too early to make a call but because alternative splicing errors are so detrimental a trial and error process is highly unlikely the cause. There is an alternative answer to the designer and my favorite. We have no fn idea….better than a bunch of “just so” stories that are misleading to science.

  45. ID does not require a designer to initiate mutations. Genetic algorithms do not require programmers to intervene to make changes and drive them towards the solution.

  46. Mung: Hahaha. And that paper shows what the landscape of all of protein sequence space is like!

    No, it doesn’t. And it does not purport to. And that was not implied by anyone. You have to read the actual paper, not only the abstract.

Leave a Reply