How not to sample protein space

Posted on April 6, 2016 by Allan Miller

Mung has drawn our attention to a post by Kirk Durston at ENV. This is my initial reaction to his method to establish the likelihood of generating a protein with AA permease (amino acid membrane transport) capability.

Durston: “Hazen’s equation has two unknowns for protein families: I(Ex) and M(Ex). However, I have published a method to solve for a minimum value of I(Ex) using actual data from the Protein Family database (Pfam),

Translation: I have published a method to solve for a minimum value of I(Ex) among proteins that presently exist.

I downloaded 16,267 sequences from Pfam for the AA permease protein family. After stripping out the duplicates, 11,056 unique sequences for AA Permease remained.

Translation: I took some proteins that actually exist. I implicitly assume that they are a representative, unbiased sample of all the AA permeases that could exist.

the results showed that a minimum of 466 [think he means 433 – that’s the number he plugs in later anyway] bits of functional information are required to code for AA permease.

Translation: the results show that the smallest number of bits in this minuscule and biased sample of the entire space is 433.

Using Hazen’s equation to solve for M(Ex), we find that M(Ex)/N is less than 10^-140 where N = 20^433.

Translation: starting from my extremely tiny sample of protein space, multiplying up any distortions (eg those due to common origin or evolution) and ignoring redundancy, modularity, exaptation, site-specific variations in constraint and the possibility of anything more economically specified than an existing protein, the chance of hitting a 433-bit AA permease by a mechanism not actually known in biology is – ta-dah! – 1 in 10^140.

441 thoughts on “How not to sample protein space”

Rumraket on April 20, 2016 at 2:55 pm said:

colewd: I am looking at the range he quoted to 10^10 to 10^64 for a 100 AA enzyme. These numbers make current evolution highly improbable

Prove it. There is nothing in his essay about this. You keep making this assertion but provide nothing that backs it up.
Kirk on April 20, 2016 at 11:11 pm said:

Allan Miller,

I can appreciate all the reasons you have given as to why we should be surprised that evolution has had sufficient time to sample folding sequence space for a given protein family. I had the same misgivings.

I think we can all agree that the larger the sample size, the closer we will get to the true value which, in this case, is X = M(Ex)/N. In reality, we do not know what X is; all we have is X’ for different sample sizes. In that case, we can measure the difference between different sample values (e.g., X’ and X”) as we increase sample size. We already know that as our sample size increases (both X’ and X”), X’-X” will decrease. As X’ approaches X, (X – X’) and (X’-X”) will both approach zero. At that point, there is no significant difference between X’ and X, and any further increase in sample size is not going to provide any significant additional information. There is an important condition, however, the individual samples must all be unique to avoid running the risk of repeatedly sampling the same subset. I don’t think there is anything controversial here. This is simply the reality that increased sample sizes results in a decreasing difference between the true value and X’, as well as X’and X”.

What is surprising is that evolution has had time to adequately sample the full extent of stable, folding functional sequence space for some protein families. It strongly suggests that such space is highly constrained. I must confess that the first time I plotted I(Ex) vs sample size, I was surprised that X’-X” began to converge to zero after only a few thousand unique sequences. But the mere fact that we all share the same incredulity due to our a priori assumptions does not entail that there is a flaw with something so basic as X’-X” approaching zero. Objective testing of results from real data suggests that if there is a flaw, it is in our assumptions. Our assumptions should be testable and when they are tested, at least one of our assumptions seems to be badly flawed.
Mung on April 20, 2016 at 11:37 pm said:

Allan Miller: It’s been a while since I linked this

What is it a link to?
cubist on April 21, 2016 at 7:08 am said:

Kirk:
Allan Miller,

I can appreciate all the reasons you have given as to why we should be surprised that evolution has had sufficient time to sample folding sequence space for a given protein family. I had the same misgivings.

I think we can all agree that the larger the sample size, the closer we will get to the true value which, in this case, is X = M(Ex)/N. In reality, we do not know what X is; all we have is X’ for different sample sizes. In that case, we can measure the difference between different sample values (e.g., X’ and X”) as we increase sample size. We already know that as our sample size increases (both X’ and X”), X’-X” will decrease. As X’ approaches X, (X – X’) and (X’-X”) will both approach zero. At that point…

At the point where X’ EQUALS X, yes…

…there is no significant difference between X’ and X, and any further increase in sample size is not going to provide any significant additional information.

Sounds legit. Now all you have to do is explain how you know how close your value-of-X’ actually is to the unknown value-of-X. Because you wouldn’t want to just make the bald, unsupported assumption that your sample size was sufficient for the value of (X – your X’) to be essentially equal to zero, right? Because, you know, if you did make the bald, unsupported assumption that your sample size was sufficient for the value of (X – your X’) to be essentially equal to zero, that would be, like, assuming your conclusion.

I don’t think there is anything controversial here. This is simply the reality that increased sample sizes results in a decreasing difference between the true value and X’, as well as X’and X”.

Very true. Now, what sample size do you need to have before the value of (X – your X’) is essentially equal to zero?
Allan Miller on April 21, 2016 at 8:56 am said:

Mung,

What is it a link to?

Sorry, this

Something was screwing with my edits yesterday.
Allan Miller on April 21, 2016 at 9:21 am said:

Kirk,

I can appreciate all the reasons you have given as to why we should be surprised that evolution has had sufficient time to sample folding sequence space for a given protein family. I had the same misgivings.

No, the problem is not time, it’s bias. Give evolution an eternity from now and it won’t find any of the ‘permease localities’ elsewhere in space, unless it can get there from where it is. Because it already has a working one. What reason would it have to go elsewhere? It is wandering round the space accessible from the one or few ur-permeases, themselves likely cobbled from parts of previously existing folded proteins.

If we always restrict ourselves to extant life during this extended process, we’d just have an approximate steady state of 10-20 thousand extant sequences. Sequences visited on the way would be dropped (to be consistent with your method).

I think we can all agree that the larger the sample size, the closer we will get to the true value

Only for unbiased sampling. This sample – unique sequences carried by extant life – cannot realistically be anywhere near representative. It is constrained by common descent, filtered by selection, and ‘sample’ size is dependent purely on the extent of fixation, cladogenesis and extinction. There are several sources of systematic bias.

If we had every such sequence that ever existed, we’d have a better sample – but even that would not represent the population of possible sequences adequately to give an estimate of the prior probability available at the start.

What we have here is a biased sample of a biased sample. The data we want is outside of where life went, if we want to know how likely it was a priori that life would find that function somewhere. You can’t find that probability by looking round the region it hit, but outside it.
david on April 21, 2016 at 9:41 am said:

Kirk,

The question of whether life has completely sampled sequence space is distinct from the question of whether your set of sequences has completely sampled the diversity in extant life. The fact that I(Ex) bottoms out as the number of sequences you use from the Pfam database increases only demonstrates the latter. Consider the example you gave earlier, where a protein family began with a single sequence and has been expanding outward in sequence space. Let’s assume expansion hasn’t reached its functional limits. We can sequence genes much faster than evolution can make new variants. As you sequence more and more members of your protein family, you will reach a point where you’ve sampled all significant point mutations in extant organisms. More sequences won’t change the frequency of each aa at each site, and I(Ex) won’t change. Now wait a million years, or whatever the relevant time scale is for drift of this protein sequence: evolutionary expansion will have added many new variants, so if you add more sequences, I(Ex) starts dropping again. Things looked static, but only from a human time scale.

You could, instead, look at how rates of divergence differ with sequence distance. If at some critical distance, protein sequences don’t diverge any further, then evolution really has reached the limits of functional sequence space. Turns out Povolotskaya & Kondrashov did exactly this (Nature 465, 922-926 (2010)), looking at how recent (on evolutionary timescales) substitution mutations tended to move a sequence toward or away from its more distant relatives, as a function of that distance. They found that for proteins dating back to LUCA, new mutations still tend to increase their sequence distance to even their most distant relatives. Thus most protein folds are still expanding in sequence space, even after 3 billion years of evolution.
Mung on April 21, 2016 at 2:46 pm said:

Allan Miller: Something was screwing with my edits yesterday.

Is petrushka the one who keeps claiming it’s not possible to design a protein?
Mung on April 21, 2016 at 2:49 pm said:

Allan Miller: What reason would it have to go elsewhere?

Evolution isn’t supposed to be based on reasoning. There’s nothing out there thinking, “I don’t need to go there, already been there.”
Allan Miller on April 21, 2016 at 2:52 pm said:

Mung,

Evolution isn’t supposed to be based on reasoning. There’s nothing out there thinking, “I don’t need to go there, already been there.”

‘Reason’, in that sentence, was merely a way of expressing ’cause’. As, I suspect, you know full well.

“What would cause it to go elsewhere?”. OK now?
Mung on April 21, 2016 at 2:53 pm said:

Allan Miller: The data we want is outside of where life went, if we want to know how likely it was a priori that life would find that function somewhere. You can’t find that probability by looking round the region it hit, but outside it.

Did it hit that region by magic, or did it search the surrounding area?

And if it did search the surroundings before hitting on this one, isn’t that evidence that nothing was found? To paraphrase you, if it found something that worked, why would it need to look elsewhere.
Mung on April 21, 2016 at 3:00 pm said:

Allan Miller: “What would cause it to go elsewhere?”. OK now?

But it’s still the same questionable premise Allan. That once evolution finds something it stops looking for anything else.

But to answer your question, the same cause that led it to where it is currently. That cause doesn’t just cease to operate.
Allan Miller on April 21, 2016 at 3:03 pm said:

Mung,

Did it hit that region by magic, or did it search the surrounding area?

Those aren’t the only options. Random (or pseudo-random) peptide sequences either have function or they don’t. They hit on it ‘by chance’, in the first instance.

And if it did search the surroundings before hitting on this one, isn’t that evidence that nothing was found?

No. That’s a strange way of looking at it – if it runs with the first one found, that’s all there was!

To paraphrase you, if it found something that worked, why would it need to look elsewhere.

Well … yeah. Imagine a ‘true search’ in your terms: a constantly shifting random sequence that has no function, but is casting around for some. It eventually hits upon the portion of space marked ‘permease’. Does that mean there was nowhere else in the entire space also marked ‘permease’? Where it hit was the only place it could go? I’d say not. But still, the permease it hit was unlikely to be the best available in that area. Enter selective tuning.
Allan Miller on April 21, 2016 at 3:08 pm said:

Mung,

But it’s still the same questionable premise Allan. That once evolution finds something it stops looking for anything else.

Evolution isn’t looking for anything. But once a viable permease has been located, there is no significant advantage accruing to a sequence that offers another. When you have none, it’s a different matter.

But to answer your question, the same cause that led it to where it is currently. That cause doesn’t just cease to operate.

It becomes superseded by selection. Once you have selective advantage by performing a function, there is no market for another change that performs the same function about as well or worse. Once you have a tuned permease, crap ones no longer attract a premium.
Mung on April 21, 2016 at 3:27 pm said:

Allan Miller: Once you have a tuned permease, crap ones no longer attract a premium.

That’s not very Darwinian of you. And as you well know, given your own example, it doesn’t need to attract a premium. Evolution is downright godlike in it’s infinite ability to find even the slightest advantageous improvement.

By the way, how is it that permease must have been at all useful when it was first discovered? Why could it not have been downright harmful?
Rumraket on April 21, 2016 at 4:11 pm said:

Mung: Did it hit that region by magic, or did it search the surrounding area?

Evolution wasn’t looking for the probability of finding a functional protein in amino acid sequence space. You keep confusing the purpose of our discussion with what evolution did. Evolution wasn’t trying to estimate the density of function in protein sequence space, evolution would have no need for such a thing, since it is an entirely esoteric exercise. Evolution isn’t writing papers, or assays, or trying to participate in discussion on an internet forum.

You are so profoundly confused.
Mung on April 21, 2016 at 4:21 pm said:

Rumraket: You are so profoundly confused.

Is that better than, worse than, or the same as, being ignorant?
Rumraket on April 21, 2016 at 4:22 pm said:

Mung: And if it did search the surroundings before hitting on this one, isn’t that evidence that nothing was found?

That no other amino acid permeases were found in the sequence space immediately surrounding the space occupied known amino acid permeases? Yes, it is. Does that lend enough weight to believe the claim that there ARE no other amino acid permeases in all of sequence space? No.

Mung: To paraphrase you, if it found something that worked, why would it need to look elsewhere.

Right! It found functional amino acid permeases, all subsequent life inherited them and they slightly drifted in sequence ever since. Does that mean mutation and selection didn’t still happen? That evolution was done “exploring” amino acid sequence space? Of course not!

You can ask, but why didn’t it find even more amino acid permeases? Hard to say, perhaps it even did, but the previous one had been already tuned by selection to a relatively high level of performance and subsequently discovered ones were just outcompeted by an already existing one. Pure speculation of course, but we can neither accept it nor rule it out. We just don’t know.
Rumraket on April 21, 2016 at 4:33 pm said:

Mung: Is that better than, worse than, or the same as, being ignorant?

As long as neither are deliberate then it’s equal to.
colewd on April 21, 2016 at 5:44 pm said:

Allan Miller,

Well … yeah. Imagine a ‘true search’ in your terms: a constantly shifting random sequence that has no function, but is casting around for some. It eventually hits upon the portion of space marked ‘permease’. Does that mean there was nowhere else in the entire space also marked ‘permease’? Where it hit was the only place it could go? I’d say not. But still, the permease it hit was unlikely to be the best available in that area. Enter selective tuning.

So in the case of a duplicated gene, After it searches the space and stumbles on function, how are the transcriptional proteins assembled to get this new gene produced at the right time? How are the splicing codes formed to remove the introns? Do we need a new histone produced for this gene to wrap around during cell division? How about a chaperone that helps it fold. Will any old chaperone do or does its partner need to evolve at the same time? If this is a nuclear protein, does it have the proper code so it cannot be blocked by the nuclear pore complex. So that code needs to evolve first?
OMagain on April 21, 2016 at 6:16 pm said:

colewd: ?

If you are so sure that unguided evolution as-is cannot possibly work then logically you already know all of the answers to these questions.

I’d suggest you do some actual research of your own towards answering these questions – why should people spoonfeed you? And let’s face it, whatever answers you are given simply produce more objections of a similar nature. Every answer produces two new gaps for you.
Rumraket on April 21, 2016 at 6:48 pm said:

colewd:
So in the case of a duplicated gene, After it searches the space and stumbles on function, how are the transcriptional proteins assembled to get this new gene produced at the right time?

Duplications are usually placed under transcriptional control of other already existing genes. Or alternatively they are sometime even duplicated complete with an accompanying transcription factor.

colewd: How are the splicing codes formed to remove the introns?

If it’s a duplicated gene, it would just contain already existing splicing regions.The spliceosome would be able to edit these out already.

colewd: Do we need a new histone produced for this gene to wrap around during cell division? How about a chaperone that helps it fold. Will any old chaperone do or does its partner need to evolve at the same time? If this is a nuclear protein, does it have the proper code so it cannot be blocked by the nuclear pore complex.So that code needs to evolve first?

No, it’s a duplicate, remember? It’s identical to an already existing gene.

And all of this is only relevant to eukaryotes anyway.
Mung on April 21, 2016 at 7:02 pm said:

OMagain: Every answer produces two new gaps for you.

So you just stop looking for answers?
Kirk on April 21, 2016 at 7:59 pm said:

I must say that I am enjoying this discussion. It is better than thinking by oneself alone in an office with no one to question anything.

Known sample bias: The genetic code biases the search and is what I use for the ground state in my analysis of protein families. So right off the bat, the genetic code strongly influences where a starting sequence will land in sequence space. Then there are a variety of random events that can influence genetic drift. Certain mutations are more probable than others, which also skews the evolutionary search. Eventually, however, I suspect that the biggest bias of the entire search, which tends to eventually wash out the other biases, is function. The protein must be able to satisfy a certain threshold of function demanded by the cell.

Detecting sample bias: This is where X’-X’’ comes in but it is also important to consider what constitutes X such that if there is a bias, X’-X’’ will be significant for two large samples that each contain unique sequences. For example, X’ and X’’ could represent I(Ex) for the same protein family but for two different phyla.

Defining ‘functional’: When I think of ‘function’, I think of a stable, repeatable 3D structure, first and foremost (for globular proteins, setting aside intrinsically unstructured proteins). I tend to think of physics as providing the rules within which life must operate. When I think of a protein family, I think, first and foremost, of similar 3D structure. Each family may provide a set of biological functions within the cell, and for any particular function, there may be more than one 3D structure (i.e., protein family) that could satisfy that functional requirement.

How vast is functional folding sequence space?: We all agree that sequence space is ‘vast’, but reading the various comments, it seems that some here assume that functional folding sequence space is also vast, at least too vast to get a representative sampling. If we see sequence space as a two dimensional ocean extending light years beyond the horizon, then it makes sense to talk about sequences that are ‘far’ away in functional sequence space. But early in my research, I realized that thinking in terms of a 2D space badly misrepresents sequence space.

21-dimensional sequence space: It seems that a 21 dimensional cylindrical sequence space is a better way to map sequence space, with the sequence sites plotted along the axis of the cylinder and, at each site, 20 dimensions radiating from the axis. (a limitation of visualizing 21 dimensions in 3D cylindrical space). It only takes 20 mutations to completely explore sequence space for a given site. A stable 3D structure would be represented by multiple threads or ribbons (if multiple amino acids are permitted at a site) twisting and braiding along the axis, converging into fine threads at highly conserved sites (or ‘nodes’) that are critical for achieving the overall structure, and diverging elsewhere, all reflecting the pairwise and higher order relationships within the 3D structure for a given family. If we could then delete the 21-D reference frame, and ‘pick up’ the resulting intricate, fine twisting, braided, threaded curves, and hold it up before our eyes, we would have a more accurate idea of what functional folding sequence space ‘looks’ like for a given protein family.

Rapid exploration of 21-D functional sequence space: Seen this way, randomly searching for one of those thread shapes, that is a member of a protein family, may take eons but once found, it would not take long at all to search out the limits of functional folding sequence space for that protein family. No area of 21-D functional sequence space is ‘far’ away. The ‘nodes’ (highly conserved sites) would very rapidly be revealed, then sub-molecular dependencies would follow next because a 3D structure is not one, large homogeneous relationship … I have shown here that the 3D structure of a protein family domain can be broken down into smaller units. These smaller units can be much more easily discovered in an evolutionary search, massively simplifying the search. All this arises out of two ideas: a) sequence space is 21 dimensional and, b) a given 3D structural domain is composed of smaller sub-molecular components that can be easily and rapidly mapped out by an evolutionary search within the confines of a 21 dimensional sequence space.

Conclusion: This may explain why it appears (at least so I argue on the basis of my own analysis) that evolution has had sufficient trials to give us a pretty good idea of the boundaries of 21 dimensional sequence space when we look at the frequency distribution of amino acids along the axis of the sequence space for that protein.
Rumraket on April 21, 2016 at 9:33 pm said:

I very much appreciate that you’re taking the time to discuss these matters Kirk. Even if we disagree I respect that you’re not above stepping in the mud, so to speak 🙂
colewd on April 21, 2016 at 11:44 pm said:

Kirk,

Rapid exploration of 21-D functional sequence space: Seen this way, randomly searching for one of those thread shapes, that is a member of a protein family, may take eons but once found, it would not take long at all to search out the limits of functional folding sequence space for that protein family. No area of 21-D functional sequence space is ‘far’ away. The ‘nodes’ (highly conserved sites) would very rapidly be revealed, then sub-molecular dependencies would follow next because a 3D structure is not one, large homogeneous relationship … I have shown here that the 3D structure of a protein family domain can be broken down into smaller units. These smaller units can be much more easily discovered in an evolutionary search, massively simplifying the search. All this arises out of two ideas: a) sequence space is 21 dimensional and, b) a given 3D structural domain is composed of smaller sub-molecular components that can be easily and rapidly mapped out by an evolutionary search within the confines of a 21 dimensional sequence space.

I am having trouble visualizing this. Do you have any pictures to share? Thanks for the very interesting post:-)
Taylor Kessinger on April 22, 2016 at 4:47 am said:

Kirk, good to see you here. I see that Allan and others in this thread have raised some of the same objections I’ve leveled in other fora where we’ve met. Namely, the pfam sequences you use are not random samples of functional sequence space. They are correlated due to common descent. The fact that the “information” values you obtain from this (biased) sample converge does not mean it is not biased. There are huge swaths of protein sequence space that are totally unexplored.

In response, you argue that protein sequence space is essentially not all that “vast”. Rather, upon finding a functional protein family, evolution will rapidly explore the nearby sequence space, meaning that it will representatively sample function sequences:

Rapid exploration of 21-D functional sequence space: Seen this way, randomly searching for one of those thread shapes, that is a member of a protein family, may take eons but once found, it would not take long at all to search out the limits of functional folding sequence space for that protein family. No area of 21-D functional sequence space is ‘far’ away. The ‘nodes’ (highly conserved sites) would very rapidly be revealed, then sub-molecular dependencies would follow next because a 3D structure is not one, large homogeneous relationship … I have shown here that the 3D structure of a protein family domain can be broken down into smaller units. These smaller units can be much more easily discovered in an evolutionary search, massively simplifying the search. All this arises out of two ideas: a) sequence space is 21 dimensional and, b) a given 3D structural domain is composed of smaller sub-molecular components that can be easily and rapidly mapped out by an evolutionary search within the confines of a 21 dimensional sequence space.

I see several issues with this response.

First, it is useful to consider exactly what a protein family is. A protein family is not the entire set of proteins capable of performing a specific function. It is a set of proteins that are homologous, i.e., related (or at least apparently related) by common descent. Your answer thereby presupposes that, for any given function, there is effectively only one protein family, in the entirety of sequence space, capable of performing that function. It is difficult to see how this could possibly be correct. Even “random” peptides often show some affinity for some function.

Second, for many functional proteins, it will definitely be the case that some single point mutations are not functional (or exhibit reduced function), but a double mutant or higher-order mutant will restore function. This effectively means the protein sequence must “cross a fitness valley”. Now, we have very good reason to think crossing a fitness valley is often something that can happen on plausible time scales (see any of Dan Weissman’s papers on fitness valley crossing), but you seem to think evolution is so good at crossing fitness valleys that it will rapidly explore an entire functional landscape, crossing numerous valleys in the process, with no problems.

In other words it seems you, like many creationists, endorse a form of “super-evolution” far in excess of what most evolutionary biologists would willingly accept.
Allan Miller on April 22, 2016 at 8:15 am said:

Rumraket,

I very much appreciate that you’re taking the time to discuss these matters Kirk. Even if we disagree I respect that you’re not above stepping in the mud, so to speak

Seconded.
cubist on April 22, 2016 at 9:18 am said:

Taylor Kessinger:

In other words it seems [Kirk Durston], like many creationists, endorse a form of “super-evolution” far in excess of what most evolutionary biologists would willingly accept.

Attacking an “evolutionary” position which no “evolutionist” actually subscribes to is a very common Creationist tactic. When the “evolutionary” position in question is an unsupportable over-extrapolation of a position which actually is subscribed to by real scientists, it demonstrates that the Creationist who came up with said over-extrapolated position must have some comprehension of what the actual science is, in spite of the fact that they abuse that understanding in (what they consider to be) the service of their Lord.

Basically, it’s the same process of distortion which other Creationists indulge in when they read what “evolutionists” have to say about evolution, extract oh-so-carefully-selected passages from what they’ve read, and present those carefully-extracted passages as “evidence” that those dirty rotten “evolutionists” know how wrong evolution really is.
Allan Miller on April 22, 2016 at 10:04 am said:

Mung,

That’s not very Darwinian of you.

Who’s trying to be Darwinian anyway? But I submit that you only dimly comprehend natural selection, if you think my sketch contradicts it in any way.

And as you well know, given your own example, it doesn’t need to attract a premium. Evolution is downright godlike in it’s infinite ability to find even the slightest advantageous improvement.

Only from where it is, not wherever improvement may lie. Once it has scoured the local area and found the best, it has no means of scouring anywhere else. All amendments will be selectively ‘downhill’, and any novel sequences will be unlikely to be as good as the selectively tuned one it already has, and hence unlikely to replace it.

By the way, how is it that permease must have been at all useful when it was first discovered? Why could it not have been downright harmful?

Coulda woulda shoulda. If it was downright harmful it would not have been fixed by selection. So its existence is at least supportive of the hypothesis that it was not. I’m illustrating general principles. Unless you are pinning your hopes on the idea that everything is downright harmful when it first arises, evolution proceeds via the ones that aren’t.
Allan Miller on April 22, 2016 at 10:12 am said:

colewd,

So in the case of a duplicated gene, After it searches the space and stumbles on function, how are the transcriptional proteins assembled to get this new gene produced at the right time?

Promoters and repressors are also stumbling around. Genes that aren’t produced at ‘the right time’ trouble us no further.

How are the splicing codes formed to remove the introns?

Why would it have introns? No new genes are possible because they come complete with introns! 🙂 But if it does and they are self-splicing, problem solved. [eta – or Rumraket’s answer]

Do we need a new histone produced for this gene to wrap around during cell division?

No.

How about a chaperone that helps it fold.

Not essential. ‘Correct’ folding can happen in a test tube with no chaperones or anything else much.

Will any old chaperone do or does its partner need to evolve at the same time?

There are generic chaperones – HSPs.

If this is a nuclear protein, does it have the proper code so it cannot be blocked by the nuclear pore complex. So that code needs to evolve first?

I dunno. You seem determined to construct a roadblock out of anything you can find lying around. Quick Mung, help colewd with that mattress!
Allan Miller on April 22, 2016 at 10:28 am said:

Thinking more on introns (I’m supposed to be painting …), their sequence can contribute to the ‘exploration’ process. If the gene is currently nonfunctional, it hardly matters that the intron-exon boundaries are not surgically maintained. If it subsequently (after a bit of flailing) produces a useful product, that product is post- whatever mechanism of exon stitching has been applied. Even if that splicing would screw it up if it were the original gene in the pseudogenisation scenario. It didn’t screw this one up – on the contrary, it made it what it is.
Mung on April 22, 2016 at 2:48 pm said:

cubist: Attacking an “evolutionary” position which no “evolutionist” actually subscribes to is a very common Creationist tactic.

Where has Kirk done this?
Kirk on April 22, 2016 at 2:57 pm said:

Colewd: I don’t have a diagram but it can be visualized by imagining a pipe-cleaner (a long wire with a series of bristles coming out at equally spaced points along the wire). So for a 100 aa protein, you would have 100 points where the bristles radiate out from the wire, and at each point where the bristles radiate out, you have exactly 20 individual bristles. I might group the bristles according to properties (e.g., nonpolar, semipolar, polar, negative charge, positive charge, etc.).

Taylor Kessinger:

A) You are correct that a protein family is not defined by a specific function. I disagree, however, with your reason as to why the set of proteins in a protein family is homologous. You suggest the homology is due to common descent. I hold that the homology is due to common tertiary structure which, in turn, is defined by physics. The set of all sequences producing a similar 3D structure defines the extreme boundaries of a structural protein family. The set of functional sequences is likely to be a subset of of the larger, structural protein family. Structure is physics dependent, function is contingent on the system. Thus, you can have the same function possibly satisfied by two or more entirely different protein families, which is why it would be a mistake, as you point out, to define protein families in terms of function. Finally, a given protein family may be capable of performing a variety functions.

B) I think it is risky to invoke common descent as an explanation. Although one can always marshal evidence that verifies common descent, falsification trumps verification and there are two essential predictions of the theory of common descent that I think are falsified by a rapidly growing body of evidence. That, however, is another discussion for another time. Suffice it to say that if we think in terms of physics and sampling, we are on very solid ground, as it will cover both the theory of common descent as well as the theory of multiple OOL events.

C) Crossing fitness valleys: The double mutant example you mention is an example of a pairwise dependency. I agree that it is more difficult to explore the full range of pairwise dependencies, and even more difficult to explore higher order dependencies. But we do not need to fully explore all of functional sequence space to adequately sample its boundaries. We only need to sample it sufficiently to tell us where those higher order relationships are, and whether the MSA has sufficient samples to give you that information. I have developed and published a method that does exactly that; it shows where the higher order dependencies are, and they start emerging after only a few hundred unique functional sequences in a MSA. Incidentally, I have found that the most important higher order dependencies are usually 3rd and 4th order.

D) The effect of higher order dependencies on the estimated value of I(Ex) and the solved value of M(Ex)/N: The existence of higher order dependencies greatly increases the amount of I(Ex) required to code for a protein family. They also greatly decrease, by numerous orders of magnitude, the probability M(Ex)/N of finding a member of that family. The method I use to estimate I(Ex) semi-ignores these higher order relationships. I say ‘semi-ignores’ them because the MSA still contains the consequences of them, even if the computational method I use computes I(Ex) by individual sites. So the result of ignoring these pairwise relationships gives a very optimistic (by many orders of magnitude) estimate for the probability M(Ex)/N. I can give a simple toy example if necessary for a 3-residue protein that permits all 20 amino acids at each site with equal probability, yet has only 20 functional sequences, not 20^3.

Four tests to see if the sampling of functional sequence space is sufficiently broad:

In this discussion I have frequently observed the untested assumption that functional sequence space for a protein family is so huge that a few thousand unique sequences are not sufficient to give us a sampling of its boundaries. I have provided one way to test this and all that has been provided by way of protest are theories as to why functional sequence space might not have been adequately sampled. Those theories need to be tested, so let me suggest four ways to test the breadth of sampling:

First Method: As I’ve stated already, see how X’-X’’ changes with increased sample size, where X is a variable sensitive to bias. To test whether you have an appropriate variable, X’-X’’ should yield large differences for very small sample size and approach zero with very large sample size. For example, if you wanted to plot language vs sample size and you started in downtown Beijing and observed no change in X’-X’’ between samples sizes of 1, 10 and 100, you should be concerned. It may be the case that 100% of the world’s population speaks Mandarin, or you may be repeatedly sampling the same group; you don’t know at that point. A solution in statistics is to choose an independent variable for X’-X’’ that will tell you whether you are sampling the same area or not. This is a general approach; different types of problems require different ways to independently verify whether the sampling is sufficient or not.

Second Method: For protein families, are deleterious mutations ever observed? If so, then the search is already contacting a boundary. If this method is used as an independent check on Method (1), it increases our confidence that we might be adequately sampling within the functional boundaries of the protein family.

Third Method: For protein family MSA’s do we observed highly conserved sites as well as other sites that permit all, or nearly all, 20 amino acids? Even if the size of functional sequence space is a complete unknown, these symptoms indicate that a) there has been enough time to try a lot of options at each site but b) something is constraining the options at certain sites (i.e., the functional boundary). Combining methods (1), (2) and (3) increases our confidence still further that sampling is across the breadth of functional sequence space, even if there are lots of space between the ‘dots’ that do the mapping. We should not confuse space between the dots with the distribution of the dots within functional sequence space.

Fourth Method: Look at the final result to see if the functional sequence space, represented by M(Ex) looks too small, a sign of inadequate sample coverage. For example, for AA permease, our estimated M(Ex) is 10^97. Does anyone seriously think that there are 10^97 functional sequences for AA permease? If anything, it looks like a huge overestimate rather than being too small.

So is the huge M(Ex) a method problem or a sampling problem? Tests (1), (2) and (3) all suggest that large MSA’s of a few thousand sequences are encountering the functional boundaries (i.e., limits to structural variability that still provide the function), so the large M(Ex) is not due to a sampling problem, it is a problem with the methodology but that is not necessarily a bad thing, depending upon the objective of the experiment. If the objective is to provide a ‘best case’ probability M(Ex)/N, then we want an overestimate of M(Ex). The primary problem is that my method semi-ignores higher order dependencies which greatly reduce M(Ex).

Why do I ignore the higher order dependencies? The main reason is that although the information regarding higher order dependencies is embedded within the large MSA, it is an enormous amount of work to extract. I’ve only done it for two protein families, ubiquitin and transthyretin and it took months of work for each one, whereas I can run a large MSA through my software and obtain an upper estimate for M(Ex) in minutes. I can radically speed up the process for obtaining the higher order dependencies, but I need to add a major module to the software which would probably take a couple months to write and debug. I don’t have that kind of time. Too many ideas for exploration and implementation, only one life. So I content myself with finding the upper limit for M(Ex) and a best case probability M(Ex)/N. That, right there, is valuable information.

Gone for the weekend I got to get to work now, and I have plans for the weekend. The soonest I can get back here will be next Tuesday.
Rumraket on April 22, 2016 at 5:01 pm said:

Kirk: In this discussion I have frequently observed the untested assumption that functional sequence space for a protein family is so huge that a few thousand unique sequences are not sufficient to give us a sampling of its boundaries.

But Kirk that is just not correct. People have been talking about function, not whether a particular protein family (defined by sequence and structural similarity) is possibly bigger than what evolution has had time to sample. It is entirely possible that for some protein superfamilies, a significant fraction of the space of functional variants of proteins that would be recognized as belonging to that family is represented by proteins in extant life, yet the particular functions done by that protein family also exist elsewhere in sequence space, represented by structures and sequences that would not be detected as homologous (and therefore not part of the family).
Allan Miller on April 22, 2016 at 5:21 pm said:

Kirk,

In this discussion I have frequently observed the untested assumption that functional sequence space for a protein family is so huge that a few thousand unique sequences are not sufficient to give us a sampling of its boundaries.

Yes, just to add my 2c, I do not recognise that assumption, nor do I accept that I make it unknowingly! The problem is that if evolution (common descent) is the cause of homology your process cannot be an unbiased sample. If you are assuming your conclusion, that’s fine and not uncommon, but if you are investigating the limits of evolution, you have to take evolution on board.

If there were but a single ancestor sequence and all members of the family were modified descendants of that, you would end up with N ‘unique’ sequences. But it would hardly be an unbiased sample of protein space, however large N was. The constraints of physics would not come into it. If, as you seem to be saying, convergence rather than common descent is the cause of homology … well, I don’t know why you’d say that. Common descent is, after all, pretty reasonable as a cause of common sequence. Convergence happens, but then it’s just another evolutionary mechanism, so if you prefer one over the other you still haven’t definitively reached a limit to evolution.

Either way, it is pretty obvious that an evolutionary process cannot thoroughly explore protein space, because it lodges on something that works and tunes it towards an adaptive peak. The rest of space is ignored (to speak loosely). So to declare that evolution should have had enough time to do so is, I think, quite fundamentally to misunderstand it.
Allan Miller on April 22, 2016 at 5:41 pm said:

One exercise Kirk could try is to break his sample down by taxonomic group. I predict that the fraction of the space apparently ‘sampled’ by evolution would increase with broader taxonomic rank, exactly as one would expect from a process of common descent. If there is convergence due to physical constraint, it acts (as do many other genomic features) to give the precise pattern common descent would predict, and opposite to that one might expect on convergence.
John Harshman on April 22, 2016 at 5:51 pm said:

Allan Miller: The problem is that if evolution (common descent) is the cause of homology your process cannot be an unbiased sample.

Ah, but Kirk assumes that common descent is not the cause of homology. He appears to believe that every species (correct me if I’ve chosen the wrong level here) was separately created, and thus protein families are not families at all, just inexplicably varied versions of the same design.
Allan Miller on April 22, 2016 at 5:58 pm said:

Kirk,

For protein families, are deleterious mutations ever observed? If so, then the search is already contacting a boundary.

Not really. A deleterious mutation means that its possessors have fewer offspring on the mean than rivals. If 3 sequences A, B and C are sequentially produced that progressively ‘improve’ a function (measured in those terms), B is beneficial when it first arrives in a population fixed for A. But when C arrives, B becomes detrimental and it is eliminated in its turn. Mutations to B are detrimental where once they were advantageous. This is not characteristic of a boundary. It’s the ‘adaptive peak’ issue I have been referring to.
colewd on April 22, 2016 at 5:59 pm said:

Allan Miller,

Promoters and repressors are also stumbling around. Genes that aren’t produced at ‘the right time’ trouble us no further.

Under this scenario is the copied gene transcribed during the mutational process as long as the copied promoter region is in tact?
colewd on April 22, 2016 at 6:02 pm said:

Kirk,

Colewd: I don’t have a diagram but it can be visualized by imagining a pipe-cleaner (a long wire with a series of bristles coming out at equally spaced points along the wire). So for a 100 aa protein, you would have 100 points where the bristles radiate out from the wire, and at each point where the bristles radiate out, you have exactly 20 individual bristles. I might group the bristles according to properties (e.g., non polar, semipolar, polar, negative charge, positive charge, etc.).

This helps a lot. Thanks
Allan Miller on April 22, 2016 at 6:04 pm said:

John Harshman,

inexplicably varied versions of the same design.

With said variation clustering hierarchically on taxonomic rank, yes … it’s a slippery argument. If one is trying to argue limits on evolution, assuming it did not happen in the first place (rather than as a conclusion from your data having assumed it for the argument) seems a dubious way to go.
Allan Miller on April 22, 2016 at 6:08 pm said:

colewd,

Under this scenario is the copied gene transcribed during the mutational process as long as the copied promoter region is in tact?

It could be. But if it produces supernumary, or dud, copies of the original product, it can become transcriptionally silent, since selection will not protect it from substitution.
Mung on April 22, 2016 at 6:10 pm said:

Why assume that proteins with similar function share a common ancestor? What justifies that assumption?
Allan Miller on April 22, 2016 at 6:29 pm said:

Mung,

Why assume that proteins with similar function share a common ancestor? What justifies that assumption?

The assumption is based on similarity of sequence (sometimes structure), not of function. Proteins of different function are assumed to share a common ancestor if they share sufficient sequence similarity.

Protein families also cluster into superfamilies. Again, it’s sequence not function.
Mung on April 22, 2016 at 6:29 pm said:

Allan Miller: Either way, it is pretty obvious that an evolutionary process cannot thoroughly explore protein space, because it lodges on something that works and tunes it towards an adaptive peak. The rest of space is ignored (to speak loosely). So to declare that evolution should have had enough time to do so is, I think, quite fundamentally to misunderstand it.

This is misguided, as I have pointed out previously. Further, you have argued the exact opposite in trying to explain how to overcome fitness valleys/saddles.
Allan Miller on April 22, 2016 at 6:38 pm said:

Mung,

This is misguided, as I have pointed out previously.

In what way? Or can you point to a comment where you ‘point this out’?

Further, you have argued the exact opposite in trying to explain how to overcome fitness valleys/saddles.

That isn’t ‘the exact opposite’. What is it with you people and dichotomies? If I argue that selection tends to push sequences to adaptive peaks, this is in no way contradictory to the observation that there are forces other than selection which will knock them off again. Evolution is not all one thing.

Nonetheless, a maladaptive valley will generally not take one into a completely unconnected region of protein space, by the very nature of the mutational process. Crossing an adaptive valley is no more a strike for novelty than climbing a hill in the first place.
Allan Miller on April 22, 2016 at 6:47 pm said:

Mung,

This is misguided, as I have pointed out previously.

Just to be sure, you are saying it is ‘misguided’ to point out that evolution, having found and tuned a particular functional sequence that works, is far less likely to find a completely unrelated novel sequence which functionally replaces the first. On what grounds do you say this?

Rather than just ‘pointing it out’, what is your argument against?
colewd on April 22, 2016 at 7:22 pm said:

Rumraket,

colewd: I am looking at the range he quoted to 10^10 to 10^64 for a 100 AA enzyme. These numbers make current evolution highly improbable

Prove it. There is nothing in his essay about this. You keep making this assertion but provide nothing that backs it up

In multicellular life average protein size is around 500 aa. So given this probability that would create a range of 10^50 to 10^320. The total evolutionary resources available since the beginning of the earth is less than 10^50. Allan has said that you can cut these odds by substructures like helixes but we have no identified mechanism how the genome would create this organization on its own.
Mung on April 22, 2016 at 7:32 pm said:

Allan Miller: Rather than just ‘pointing it out’, what is your argument against?

That evolution stops because it’s found an adaptive peak for some protein function. More explicitly, that exploration of sequence space ceases. One of your objections to Kirk is based on this misguided notion.

The concept of “a completely unrelated novel sequence which functionally replaces the first” is a straw-man. If a gene is duplicated and subsequently diverges from the original it doesn’t have to functionally replace the first. Would you say this is not a continuation of the exploration of sequence space?