Evolution and Functional Information

Posted on March 8, 2017 by swamidass

Here, one of my brilliant MD PhD students and I study one of the “information” arguments against evolution. What do you think of our study?

I recently put this preprint in biorxiv. To be clear, this study is not yet peer-reviewed, and I do not want anyone to miss this point. This is an “experiment” too. I’m curious to see if these types of studies are publishable. If they are, you might see more from me. Currently it is under review at a very good journal. So it might actually turn the corner and get out there. An a parallel question: do you think this type of work should be published?

I’m curious what the community thinks. I hope it is clear enough for non-experts to follow too. We went to great lengths to make the source code for the simulations available in an easy to read and annotated format. My hope is that a college level student could follow the details. And even if you can’t, you can weigh in on if the scientific community should publish this type of work.

Functional Information and Evolution

http://www.biorxiv.org/content/early/2017/03/06/114132

“Functional Information”—estimated from the mutual information of protein sequence alignments—has been proposed as a reliable way of estimating the number of proteins with a specified function and the consequent difficulty of evolving a new function. The fantastic rarity of functional proteins computed by this approach emboldens some to argue that evolution is impossible. Random searches, it seems, would have no hope of finding new functions. Here, we use simulations to demonstrate that sequence alignments are a poor estimate of functional information. The mutual information of sequence alignments fantastically underestimates of the true number of functional proteins. In addition to functional constraints, mutual information is also strongly influenced by a family’s history, mutational bias, and selection. Regardless, even if functional information could be reliably calculated, it tells us nothing about the difficulty of evolving new functions, because it does not estimate the distance between a new function and existing functions. Moreover, the pervasive observation of multifunctional proteins suggests that functions are actually very close to one another and abundant. Multifunctional proteins would be impossible if the FI argument against evolution were true.

216 thoughts on “Evolution and Functional Information”

Mung on March 8, 2017 at 2:47 am said:

The fantastic rarity of functional proteins computed by this approach emboldens some to argue that evolution is impossible.

Given that evolution is changes in gene frequencies in a population, your claims make no sense.
swamidass on March 8, 2017 at 2:51 am said:

Mung: Given that evolution is changes in gene frequencies in a population, your claims make no sense.

This is not my claim Mung. This is the claim that several ID and YEC leaders make. If you do not think it makes sense, you should go tell them.
Mung on March 8, 2017 at 3:01 am said:

swamidass: This is not my claim Mung.

Actually it is your claim. You referred me to your supplementary data. Do you mean the article by Kirk Durston? Is it your claim that Kirk Durston claims that evolution is impossible? Because that would be a poor reading indeed of what he wrote.
swamidass on March 8, 2017 at 3:10 am said:

How would you explain his position? An “evolutionary search” for functional proteins is impossible? An “evolutionary process” that produces functional proteins is impossible? Or maybe this quote?

“The entire evolutionary capacity of the universe, if it were nothing but proteins recombining every second, is still more than 150 orders of magnitude too inadequate to expect to produce any of the 10^62 possible functional sequences for RecA.”

Orr maybe this one?

“A more formal way to evaluate the hypothesis that biological proteins were obtained in a blind evolutionary process is to apply the Universal Plausibility Metric (UPM), where an hypothesis is definitively operationally falsified if the UPM for that hypothesis is less than 1.[11] For RecA occurring somewhere within the universe during its history to date, the UPM = 10-142, which means that the hypothesis that it could be located by an evolutionary process is definitively falsified. ”

How would you read this statements? I think “evolution is impossible” is a pretty close approximation.
keiths on March 8, 2017 at 3:12 am said:

What a waste of time you are, Mung. This kind of crap is why people don’t respect you.
Tom English on March 8, 2017 at 5:58 am said:

We’ve got a scientist calling for discussion of his research. This post should be featured. We should leave our fighting to other threads.
Tomato Addict on March 8, 2017 at 6:05 am said:

^^^ Seconded!
stcordova on March 8, 2017 at 6:57 am said:

Dr. Swamidass,

Kirk is a friend of mine. But that said, I think the calculations in the paper are worth discussing.

Take for example an orphan protein/gene in a species. Since there is nothing to align it to, what would its MISA be? Suppose the protein in question is an enzyme with an enzymatic active site and allosteric sites, with a good number of positions changeable. It would seem FI and MISA in this case would clearly not be the same. Is that correct?

Next, suppose a protein/gene is orphan to only two or a small number of species, say primate specific proteins. Suppose further they are 100% identical in all the species they are found in. As you said, this would report a high MISA if the possibility of recent common ancestry is not incorporated into the MISA calculation. MISA would leave out a necessary correction for similarity due to common ancestry rather than similarity due to function.

If what I just said makes sense, and if indeed you accurately represent the ideas from the ID side, then in that sense I think you raise valid concerns. But, I’ve now said it for years, ID proponents need to consider dropping information theory arguments. As an electrical engineer in a former life, I studied Shannon’s theorems in grad communication classes. I’ve not seen good justification for incorporating Shannon like theories into arguments for or against evolution. It just makes a mess of things. I’ve also pointed out what looks like the futility of estimating the active information in a selective search such at Scott Minnich’s faster evolution of the Cit+ mutant in 19 days vs. Lenski’s Cit+ in 15 years. Can any ID proponent give a credible measure of the active information in Minnich’s 19 day search relative to Lenski’s 15 year LTEE? Even if they did, does it yield any insights above what we already know?

That said, I don’t think evolutionary theory is a successful theory at all. A successful biological theory is something like Mitchell’s chemiosmotic hypothesis for ATP synthesis or Krebs discovery of the TCA cycle, both of which got Nobel prizes. Noticing the similarity between species has been around since creationist Linnaeus. Similarity proves similarity, it doesn’t prove the feasibility of evolving complex function. One can assume common descent because of similarity, but it doesn’t mean novel features arise easily.

MISA does seem to identify the strength of conservation if the protein appears in all life forms.

Nice to see you, and God bless you.
stcordova on March 8, 2017 at 7:35 am said:

In fairness to my ID colleagues, it would seem a charitable reading would allow for the assumption analyzing deeply conserved proteins vs. orphans. In the case of orphans, cassette mutagenesis experiments like Saur would act as a reasonable facsimile to create MISA measurements.

That said, I would argue, adding the layer of Shannon calculation is totally superfluous and is a gratuitous and un-necessary insertion of information theory where it adds only confusion rather than clarity.

The concern about proteins with essentially the same function, similar tertiary structure but very different sequences seem valid. I seem to recall on study said proteins with the same function and tertiary structure can have as little as 12% sequence homology! So these proteins may not get picked up in a BLAST.

I discovered it while arguing with Rumraket and Allan Miller over Lysyl Oxidase in a thread about Stephen Meyer’s book, Darwin’s Doubt. I’m sorry now that I’ve lost the reference to the study! If I get around to it I may re-visit that discussion.

All the forms of the the Lysyl Oxidase molecule won’t show up by blasting one gene or the protein sequence of one species because of the lack of sequence homology, even though there is tertiary homology.
swamidass on March 8, 2017 at 7:41 am said:

Tom English: We’ve got a scientist calling for discussion of his research. This post should be featured. We should leave our fighting to other threads.

Thanks for the kind reception.

stcordova: Kirk is a friend of mine. But that said, I think the calculations in the paper are worth discussing.

I’ve talked to Kirk too. He is a nice guy and none of this is directed at him personally. My issue is with the math, and the high certainty some have about him “disproving evolution”. I’m a theistic evolutionist, but I am fine with someone debunking evolution if it really is false. But if it really is false, it should not take incorrectly applied math to debunk it.

stcordova: Take for example an orphan protein/gene in a species. Since there is nothing to align it to, what would its MISA be? Suppose the protein in question is an enzyme with an enzymatic active site and allosteric sites, with a good number of positions changeable. It would seem FI and MISA in this case would clearly not be the same. Is that correct?

You are right, almost.

First, your calculations are exactly right. The reduced sampling will increase the measured MISA, and it will be much higher than FI. However, care is taken to make sure that enough sequences are sampled so that the MISA levels off. So this example would be thrown out. What I show in this paper is that when sequences have shared history (common descent), then if we are just sampling extant sequences, then even if MISA levels off it is not a good estimate of FI. E.g. it takes hundreds of millions of years before the “memory” of the starting point is forgotten in mammalian evolution.

Essentially “history” is another source of shared information that is not accounted for if one just assumes that extant converged MISA = FI.

stcordova: Next, suppose a protein/gene is orphan to only two or a small number of species, say primate specific proteins. Suppose further they are 100% identical in all the species they are found in

Once again almost. Once again, there are too few sequences for MISA to converge (see the Figures in the paper?), so these cases would be thrown out too. One would have to choose proteins that are in many more organisms.

stcordova: MISA does seem to identify the strength of conservation if the protein appears in all life forms.

MISA does measure strength of conservation (i.e. selection), but it also measures shared history and mutational deviations from uniformity. And the point at which selection (conservation occurs) for important proteins (e.g. DNA and protein replication) most definitively NOT at the boundary of function/non-function. Rather it is at the boundary of high/low function (qualitatively speaking).

For all these reasons, and more, MISA is a really bad way to estimate FI. It just does not work. We need some way of subtracting out the strength of these other sources of information, but this appears to be a nearly unsolvable problem.
swamidass on March 8, 2017 at 7:55 am said:

One more contribution I noticed. I wonder if anyone else has noticed this too.

It seems that there is an attempt to coin this new term FSC. Elsewhere, FSC is equated with ASC. http://robertmarks.org/REPRINTS/2014_AlgorithmicSpecifiedComplexity.pdf

The part I find interesting is that ASC (as defined by Marks) and FSC (as defined in the paper I critique) are identical to mutual information. Usually, they compute mutual information under a uniform probability model, even though this is an incorrect model of evolutionary searches. There is a remarkable illusion that results with this proliferation of new terms to define old ideas. It seems like they are talking about something new, but they are not. Mutual Information (even of the Komolgrov sort) is a very well understood idea with a great deal of theory behind it.

All this theory ends up dismantling many of these information-based arguments. But by naming things differently and adjusting the variables, if one is not already very familiar with the formulas, the connections are not obvious. For example, Marks argues that because Komolgrov complexity is an upper bound (true!) that his ASC estimate is a lower bound (false!). Hidden in his formulation is the critical importance of the background distribution, which they assume in biology is always uniform. In evolution this background distribution (sampling around existing sequences) is very hard to model analytically, at its core it is non-parametric, but is easily going to be much much more efficient then uniformly random searches in protein space.

In the standard formulation for mutual information (which is algebraically equivalent to ASC under conditions I’ll describe later if people are interested), the error he makes in asserting ASC is a lower bound becomes obvious. But because he has redefined the term and changed the notation, this is no longer obvious. It becomes clear that ASC is actually just an upper bound that is reduced as the sampling distribution is understood, and is therefore not useful for his argument.
Rumraket on March 8, 2017 at 7:56 am said:

On that “16 days vs 15 years citrate evolution” stuff, here’s Zachary Blount:
https://telliamedrevisited.wordpress.com/2016/02/20/on-the-evolution-of-citrate-use/

“Rapid evolution of citrate utilization”

In the title of their paper and throughout, Van Hofwegen et al. emphasize that, in their experiments, E. coli evolved the ability to grow aerobically on citrate much faster than the 30,000 generations and ~15 years that it took in the LTEE. That’s true, but it also obscures three points. First, we already demonstrated in replay experiments that, in the right genetic background and by plating on minimal-citrate agar, Cit+ mutants sometimes arose in a matter of weeks (Blount et al. 2008). Second, rapid evolution of citrate utilization—or any evolution of that function—was not a goal of the LTEE. So while it is interesting that Van Hofwegen et al. have identified genetic contexts and ecological conditions that accelerate the emergence of citrate utilization (as did Blount et al., 2008), that in no way undermines the slowness and rarity of the evolution of this function in the context of the LTEE (or, for that matter, the rarity of Cit+ E. coli in nature and in the lab prior to our work). Third, the fastest time that Van Hofwegen et al. saw for the Cit+ function to emerge was 19 days (from their Table 1), and in most cases it took a month or two. While that’s a lot faster than 15 years, it’s still much longer than typical “direct selections” used by microbiologists where a readily accessible mutation might confer, for example, resistance to an antibiotic after a day or two.

So while we commend the authors’ patience, we do not think the fact that their experiments produced Cit+ bacteria faster than did the LTEE is particularly important, especially since that was not a goal of the LTEE (and since we also produced them much faster in replay experiments). However, in a manner that again suggests an ulterior nonscientific motive, they try to undermine the LTEE as an exemplar of evolution. The final sentence of their paper reads: “A more accurate, albeit controversial, interpretation of the LTEE is that E. coli’s capacity to evolve is more limited than currently assumed.” Alas, their conclusion makes no logical sense. If under the right circumstances the evolution of citrate utilization is more rapid than it is in the LTEE, then that means that E. coli’s capacity to evolve is more powerful—not more limited—than assumed.
Rumraket on March 8, 2017 at 8:00 am said:

stcordova: One can assume common descent because of similarity

Nobody does that. One concludes common descent due to nesting hierarchical patterns of similarity of entities known to imperfectly reproduce themselves(this directly predicts the pattern), not mere similarity.

One inherently yields a tree structure, and a tree structure is only predicted to exist from a stochastic evolutionary process of decent with independent modification, there’s no a priori reason to expect such a pattern on independent creation.
swamidass on March 8, 2017 at 8:00 am said:

stcordova: In fairness to my ID colleagues, it would seem a charitable reading would allow for the assumption analyzing deeply conserved proteins vs. orphans. In the case of orphans, cassette mutagenesis experiments like Saur would act as a reasonable facsimile to create MISA measurements.

Would it though?

One could compute a MISA but this still would not equal FI.

Mutating a single protein would only sample from one family, not all the families with the function. There is no known way of estimating the number of additional families that should be sampled or for estimating their average size or their maximum size.

Also how would one make sure the threshold for function was correct? Minimal function is very hard to detect experimentally but might be enough to initiated a positive selection driven optimization.
Allan Miller on March 8, 2017 at 8:49 am said:

Hi Dr Swamidass, you may be interested in a post I made last year discussing these issues (actually, being a bit sarcastic about them). Kirk Durston was kind enough to comment (and made me regret my tone somewhat!). But the main thrust is the unreliability of using a set of surviving and commonly descended sequences as if it were an unbiased sample of an entire space.

ISTM that evolution would have to be untrue, and proteins neither genetically related nor subject to selection, before this method of doubting evolution could gain traction.
Allan Miller on March 8, 2017 at 9:02 am said:

stcordova,

All the forms of the the Lysyl Oxidase molecule won’t show up by blasting one gene or the protein sequence of one species because of the lack of sequence homology, even though there is tertiary homology.

And one valid interpretation of this fact of ‘conservation’ of structure while losing all sequence signal is the enormous interconnectedness of protein space. Because acids are not 20 completely different things, bitwise substitution can proceed to the extreme of random alignment, while the structure stays intact throughout, a Cheshire Cat smile constraining substitutiuon, but not absolutely.
swamidass on March 8, 2017 at 9:26 am said:

Allan Miller: unreliability of using a set of surviving and commonly descended sequences as if it were an unbiased sample of an entire space.

Yup that is exactly right. Good post. And Kirk is a nice guy. I’ve felt bad for my tone with him more than once.
dazz on March 8, 2017 at 10:44 am said:

Rumraket:
On that “16 days vs 15 years citrate evolution” stuff, here’s Zachary Blount:
https://telliamedrevisited.wordpress.com/2016/02/20/on-the-evolution-of-citrate-use/

Good luck with that. Sal ignored me twice when I pointed to those replay experiments, he just keeps repeating the same crap.
petrushka on March 8, 2017 at 11:24 am said:

Allan Miller:
stcordova,
And one valid interpretation of this fact of ‘conservation’ of structure while losing all sequence signal is the enormous interconnectedness of protein space. Because acids are not 20 completely different things, bitwise substitution can proceed to the extreme of random alignment, while the structure stays intact throughout, a Cheshire Cat smile constraining substitutiuon, but not absolutely.

Sal is confused by word ladder?
stcordova on March 8, 2017 at 2:21 pm said:

Allan Miller:

Hi Dr Swamidass, you may be interested in a post I made last year discussing these issues (actually, being a bit sarcastic about them). Kirk Durston was kind enough to comment (and made me regret my tone somewhat!).

Thanks for pointing to that thread and it gave Kirk a chance to respond. Kirk’s a friend and well, though I think Dr. Swamidass raises important points, I don’t think Kirk (and the rest of the IDists) ideas are irredeemable.

I think trying to evaluate an entire protein at once is challenging. It might be better to decompose the problem into individual domains and any necessary secondary structures.

The individual domains are probably a more tractatable route.

Furthermore, I think sampling the tertiary space (at least computationally) is far more important than the sequence (primary) space. Then some probabilities might be more accurately arrived at.

One example of a domain:

https://en.wikipedia.org/wiki/Zinc_finger

Would we say that we need at least 4 residues (2 Cysteine and 2 Histidine) in the right place in a tertiary structure to make a zinc finger? That’s at least 8 bits (though I cringe having to covert it to bits since that is superfluous and gratuitous). That’s not 150 bits, but OK, this is a start.
Allan Miller on March 8, 2017 at 2:41 pm said:

stcordova,

I think trying to evaluate an entire protein at once is challenging. It might be better to decompose the problem into individual domains and any necessary secondary structures.

It would be both much better and not nearly good enough! When you have domains, and a mechanism by which a domain may migrate within a protein, or be donated to a completely different one, you aren’t really any closer to having a sensible handle of the origin of the domain. For a given domain in a given peptide, its origin was almost certainly outside of the protein in which you are currently looking.

Furthermore, I think sampling the tertiary space (at least computationally) is far more important than the sequence (primary) space. Then some probabilities might be more accurately arrived at.

Any method you use by ‘sampling’ spaces or subspaces using extant sequences suffers from the same problem – the strong biases imparted by selection and common descent on your data set.

Would we say that we need at least 4 residues (2 Cysteine and 2 Histidine) in the right place in a tertiary structure to make a zinc finger? That’s at least 8 bits (though I cringe having to covert it to bits since that is superfluous and gratuitous).

Yep, using bits is a bit fingers-on-a-blackboard for me. We can do so because these are constructed as linear polymers, so have the potential to be analysed as strings. But the strong interactions between ‘bits’, and the way changes affect the lability of subsequent amendments, makes bit-analysis a long way from satisfactory as a representation of possibility. Even worse when people insist that 20 acids == 20 different symbols (valid for Cys, Pro, maybe His, but not much else).
stcordova on March 8, 2017 at 2:44 pm said:

Allan Miller:

Any method you use by ‘sampling’ spaces or subspaces using extant sequences suffers from the same problem – the strong biases imparted by selection and common descent on your data set.

Agreed. Thank you for your response.
Tom English on March 8, 2017 at 4:04 pm said:

Joshua,

I’m stuck on Equation 2. Is $P(a \text{ at } i)$ a conditional probability, or is it a joint probability?

1. I don’t see the sense in treating the sequence index as a random quantity.

2a. If $P(a \text{ at } i)$ is the conditional probability of amino acid $a,$ given that the index is $i,$ then the first term on the right-hand side has got to be the entropy $H(A)$ of the random amino acid, which I’m denoting $A.$ But I don’t get it.

2b. If $P(a \text{ at } i)$ is the joint probability of amino acid $a$ and index $i,$ then the first term has got to be the sum of entropy values $H(I)$ and $H(A),$ where $I$ denotes the random index. But I don’t get it.
swamidass on March 8, 2017 at 4:33 pm said:

Thanks for looking at the math Tom.

P(a at i) is neither the conditional probability nor the joint probability. It should be read as “the probability that we find amino acid a at position i in the alignment.” This is just the normalized count of the amino acid at that position in the alignment.

The double summation (with the right sign) is just the entropy of the PSSM. The log term outside the summation is the entropy of a uniform distribution of amino acids in the PSSM (i.e. the maximum entropy possible). So we have maxent – measured entropy = mutual information according a uniform distribution. And this is just copying the original formula they use in a easier (to me) to follow notation. You can see the relationship between MI and these entropies (H) here: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities

Though looking closer at your definition this might be equivalent to how you are defining conditional probability. Though I do not usually think of it that way because i is not really a random variable and we store the PSSM in array that just lets us index it.
Tom English on March 8, 2017 at 7:51 pm said:

swamidass: maxent – measured entropy = mutual information according a uniform distribution

Sorry. I’m familiar with information theory, but I can’t make sense of this. I see now that

is the entropy of a uniform distribution on the space of all length- $L$ sequences over the set $\mathcal{A}$ of 20 amino acids. Let’s write $U$ for the random variable with uniform distribution of probability on $\mathcal{A}^L.$ Then we’ve got $H(U)$ minus what? It looks like

where is a random sequence over the amino acids, and the random variables are mutually independent. Now, the relative entropy (Kullback-Leibler divergence) is

This seems to match your Equation 2. I do not see a match to any of the following expressions of mutual information.

(1)

(Note: I put the $\LaTeX$ source between tags [latex] and [/latex]. Also, to suppress markdown, put backquotes before and after the $\LaTeX$ region. It’s hard to fix problems in the comment editor, so I compose in my usual environment, and copy-and-paste to the comment.)
Tom English on March 8, 2017 at 8:37 pm said:

Joshua,

I’ve now taken a quick look at Durston et al. They write:

The change in functional uncertainty from the null state is, therefore, `

$\Delta H(X_\emptyset(t_i), X_f(t_j)) = \log (W) - H(X_f(t_i)). \qquad (5)$

`

I don’t find references to mutual information, relative entropy, or divergence in the paper. It seems that they stick with “change in functional uncertainty.” So I claim that when the “null state” distribution is uniform, the difference in entropy is the relative entropy (Kullback-Leibler divergence).
Mung on March 8, 2017 at 8:40 pm said:

swamidass: How would you read this statements? I think “evolution is impossible” is a pretty close approximation.

I’d read them in context. And I’d try to accurately represent his actual views. And if I had doubts I might even send him an email and ask.

Durston:

One response has been to suggest that nylonase is an example of nature producing functional information. This response illustrates a chronic lack of rigor that is often evident in such assertions. A method has been published for measuring the change in functional complexity due to an evolutionary process. The change in functional complexity required to produce the novel function of breaking down nylon is actually trivial, of no statistical significance. It is a mistake to assume that a
novel function is equivalent to a statistically significant increase in functional information or functional complexity. Novel functions can be achieved with little or no change in functional information. Any claims that nature can produce a non-‐trivial, statistically significant gain in functional information needs to be supported by some actual numbers. A method is available in the literature for doing that.

He admits novel functions can be achieved by an evolutionary process.

To put it another way, he doesn’t claim that evolution is impossible.
Tom English on March 8, 2017 at 8:48 pm said:

By the way, isn’t that last $i$ subscript a typo? Looking at what comes immediately before, I think the last term should be $H(X_f(t_j))$ .
swamidass on March 8, 2017 at 8:57 pm said:

Tom, thanks for looking at this.

I should emphasize two things. (1) there are multiple ways of deriving this formula. (2) they do not use the term “mutual information” but invent the term FSC, even though the formula is the same. So I am just pointing out that the formula is exactly the same as MI, and this the term I use.

To derive this, I am using the starting identity.

I(PSSM, FUNC) = H(PSSM) – H(PSSM | FUNC)

Where PSSM is the the position specific scoring matrix, and FUNC is the presence of a specified function. Now I just compute the entropy of these things assuming the PSSM follows a uniform distribution when not conditioned on function.

H(PSSM) = -L * 20 / 20 * log 1 /20
= – L log 1 /20

And then we compute the entropy of the PSSM of sequences with a given function.

H(PSSM | FUNC) = – sum_i sum_a P(a at i) log P(a at i)

And now this gives the MI of a sequence alignment, MISA…

I(PSSM, FUNC) = – L log 1 /20 + sum_i sum_a P(a at i) log P(a at i)

So that is what they are computing when they compute FSC. It is algebraically equivalent, to MI with a maxent prior distribution and requires no new term or theory to describe.

This computation computes the number of “bits” of information the members of the family share in common. Because they all have common function, the intuition is that that common information is what defines the function. The intuition makes sense, but is only correct if (1) there is only one cluster of functional sequences and (2) we are uniformly randomly sampling sequences. Both these assumptions are false, so MISA is not a good estimate of FI.

As I demonstrate in the simulation, MISA also encodes information about the sampling distribution, which is not uniform, and is strongly shaped by common ancestry, selection, and mutational bias.

Does that make sense?
swamidass on March 8, 2017 at 9:03 pm said:

I should also clarify that this, essentially, is what they assert.

swamidass: I(PSSM, FUNC) = H(PSSM) – H(PSSM | FUNC)

But I show in the simulations that they are instead computing.

H(PSSM) – H(PSSM | FUNC, MUTATIONAL_BIAS, HISTORY, SELECTION)

Or rather,

I ( PSSM, FUNC, MUTATIONAL_BIAS, HISTORY, SELECTION)

They have to parse out each of these independent contributions to get an estimate of I(PSSM, FUNC), but they do not seem to realize this is even an issue.

This is one reason I strongly disfavor the term “FSC” to describe MISA. That tries to settle the scientific question with a definition. It makes the whole conversation confusing. That is why I fall back to the neutral term MISA, and ask if MISA is a good estimate of FI. It is not.
swamidass on March 8, 2017 at 9:08 pm said:

And one final thought. One step forward would be to estimate..

I(PSSM, FUNC) (APPROX=) H(PSSM | MUTATIONAL_BIAS, HISTORY, SELECTION) – H(PSSM | FUNC, MUTATIONAL_BIAS, HISTORY, SELECTION)

To do this, one would have to explicitly model mutation, history and selection its influence on the entropy of the PSSM. And then compare this with observed PSSM. I present one way of doing that to account for history. However, doing this for selection appears intractable. And this comes with its own problems: assuming the local sampling gives a good estimate of the global space (which is false) and that there is only one family (which is also false).

So while that is the way forward, I do not think it is possible to tractably compute this. Regardless, FI tells us nothing about evolvability any ways.
swamidass on March 8, 2017 at 9:10 pm said:

keiths: Any claims that nature can produce a non-‐trivial, statistically significant gain in functional information needs to be supported by some actual numbers.

Which I have done! I have demonstrated here that mindless natural processes produce arbitrarily high MISA, which is his way of computing FI. There are only two ways forward. (1) I have demonstrated FI is easily generated by mindless processes or (2) MISA is a horrible estimate of FI. Pick your poison.
keiths on March 8, 2017 at 9:13 pm said:

swamidass,

I’m not agreeing with Durston. Just pointing out that Mung is missing Durston’s obvious point — a point to which you are ably responding.
swamidass on March 8, 2017 at 9:19 pm said:

keiths: I’m not agreeing with Durston. Just pointing out that Mung is missing Durston’s obvious point — a point to which you are ably responding.

I know. I was mainly “responding” to Durston’s challenge.
Tom English on March 8, 2017 at 9:22 pm said:

swamidass: Does that make sense?

I’ll have to come back to this later — probably not until tomorrow. But scanning over what you’ve written, I have a question: Is your Equation 2 equivalent to their Equation 6, or is there something else going on?
Alan Fox on March 8, 2017 at 9:33 pm said:

Moved a comment to guano Noyau is available for insulting.

ETA and a comment quoting the guano’d comment – for continuity.
keiths on March 8, 2017 at 9:41 pm said:

The guanoed comments can be found here.
Alan Fox on March 8, 2017 at 9:44 pm said:

keiths,

And here
keiths on March 8, 2017 at 9:46 pm said:

swamidass:

The intuition makes sense, but is only correct if (1) there is only one cluster of functional sequences and (2) we are uniformly randomly sampling sequences. Both these assumptions are false, so MISA is not a good estimate of FI.

Andreas Wagner’s book Arrival of the Fittest discredits those two assumptions in a very readable way, which is why I frequently recommend it to ID proponents.

For example, here Wagner discusses how proteins having the same fold and function form vast networks rather than small clusters:

Plants and animals dwell on different major branches of life’s tree, because their common ancestor lived more than a billion years ago. Their globins are staggeringly different, which reflects their long and separate evolutionary journey. For instance, the globins from lupins and insects differ in almost 90 percent of their amino acids. Yet these globins not only bind oxygen, they also still fold into the very similar shapes of figure 12. The fold on the left is from a legume, the one on the right from a midge, a tiny two-winged fly. Both proteins have several spiral-staircase-like helices that are arranged similarly, such as the two helices that run in parallel from the upper left to the lower right. The image does not do full justice to just how similar these globins are. If you rotated these molecules to place one exactly above the other, their atoms would occupy almost exactly the same places. Despite more than a billion years of separation, these globins still fold the same way.

The amino acid differences between these globins are extreme, but not unusual. Even globins from some animals, for example those of clams and whales, can differ in more than 80 percent of their amino acids. Despite these differences, though, these and thousands more globins from other organisms are connected by a network of unbroken paths through the protein library, paths that began at their common ancestor, took one amino-acid-changing step at a time, but left the text’s meaning unchanged. You will recognize a theme that we already encountered in the metabolic library, where evolution could travel far and wide without losing the meaning of a metabolic phenotype. The steps evolution takes through the protein library are different— single amino acid changes instead of horizontal gene transfer— but the principle is the same. A genotype network connects globins and extends its tendrils far through the protein library. Evolution can explore the library along this network without falling into the deadly quicksand of molecular nonsense.

When it comes to forming vast and far-reaching genotype networks, globins are not an exception but the rule. Enzymes with the same fold, catalyzing the same reaction, and sharing the same ancestor typically share less than 20 percent of their amino acids. We know this because scientists have mapped the location of texts encoding thousands of known enzymes in the library. By cataloging these texts, we can map the paths of genotype networks in the library, which reveals that some networks can reach even further through the library than globins…
swamidass on March 8, 2017 at 11:04 pm said:

keiths: For example, here Wagner discusses how proteins having the same fold and function form vast networks rather than small clusters:

This is a bit tricky to relate. The real question is if all these globins can all be aligned. They might, and then this story would not invalidate the MISA approach (on its own). As nice of a qualitative picture it pantes, it is not quantitative.

An important point here is that “small” clusters is very relative. One of the “small” clusters that Kirk calculates is “just” about 10^60 members. That is a gigantic number. The thing is that this has to be compared with a much much larger sequence space: 20^150 for proteins 150 aa long. Given this, it is entirely possible that a “small” cluster can produce the expansive network that Wagner describes.

What I like about Kirk’s work is that he is moving beyond these sorts of stories to try and compute something from the data. I affirm that entirely. The problem though, is that his number is not telling what he thinks it is.

I should add that PZ Meyers totally failed in his debate with Kirk for this reason, and also in the follow up blog post. As smart of a guy he is, he did not understand the math here.
keiths on March 9, 2017 at 1:25 am said:

swamidass,

A longer reply later, but for now I’ll note that a network extending 90% of the way across sequence space isn’t “a small cluster”, regardless of how many members it contains relative to the whole.

In other words, “small” is a function of extent as well as volume. There’s an enormous evolutionary difference between 10^60 members crammed into a hypersphere vs 10^60 members spread out across 90% of the diameter of sequence space.

If sampling were random and uniform, then extent wouldn’t matter; only volume would. But as you point out, the assumption of random uniform sampling is entirely bogus. Evolution doesn’t work that way.
OMagain on March 9, 2017 at 8:30 am said:

keiths: . But as you point out, the assumption of random uniform sampling is entirely bogus. Evolution doesn’t work that way.

I often ask people like KF when they talk about this sort of thing what process is doing the sampling? So far I’ve not yet had an answer, obviously because when they stop and think about it they realize there is no “random sampler” module that pokes about in sequence space randomly on the off chance of a hit.

Yet, somehow, that has never stopped their claims. Hopefully they’ll read this thread 🙂
Allan Miller on March 9, 2017 at 8:47 am said:

OMagain,

Yes, my own response to the generalised, simplified v^n analysis of protein space has crystallised into wondering how people imagine such an implicit independent-probability sampler to be implemented. KF is big on the ASCII gambit – as if all string spaces have the same properties.
swamidass on March 9, 2017 at 8:58 am said:

Tom English: I’ll have to come back to this later — probably not until tomorrow. But scanning over what you’ve written, I have a question: Is your Equation 2 equivalent to their Equation 6, or is there something else going on?

You got it,. Nothing else going on.

I’m using the same formula, but using the standard name for it. There is no reason to coin a new term, so I do not use FSC. Like I said, the term FSC (Functional Sequence Complexity) sort of assumes the conclusion by using the word “Functional”. However we label the formula, this is just a number arrived at by this formula. We actually have to test if it measures what we think it does. The theoretical analysis alone is not enough. So the real question: does MISA (i.e. FSC) approximate FI? The answer is “no”.
swamidass on March 9, 2017 at 4:19 pm said:

Here are a couple more quotes from the article I am critiquing, for those that doubt my representation of the claims being made here…

QUOTE: “The idea that a Darwinian process can produce the kind of functional information required to code for the average functional protein family, not to mention all of biological life, is a popular one, but when actually tested against real data, is definitively falsified.”

Let’s leave a side that Darwinian processes is a strange limitation. The key thing is that a whole range of processes including neutral evolution and biophysics produces this amount of FI (as measured by MISA).

QUOTE: “For RecA occurring somewhere within the universe during its history to date, the UPM = 10-142, which means that the hypothesis that it could be located by an evolutionary process is definitively falsified. On the other hand, RecA requires only 832 Fits (Functional Bits) of information to sequence, a quantity of functional information that intelligence can easily generate.”

That last comment I find really interesting. Where is the data that demonstrates that intelligence can produce complex functional proteins from scratch? This is a major open problem in science that no mind we know of can solve. The fact that we cannot solves this problem with our intelligent effort (and the closest we get is with random screening and massively parallel simulations), should make us doubt that logic greatly.

Though to be clear, I am a theistic evolutionists. I believe God created us through evolution, so in an ultimate sense I do think this is product of His “Mind.” I do not think that science can rule out God didn’t proximately produce this either. Though, I do not think it makes much sense to reduce God to “intelligence.” I put this here so that hopefully my critics will see that I am not arguing against God at all. I’m just point out what looks like, to me, to be incorrect math. If God created us, which I believe He did, I am not sure what He needs of incorrect math.
colewd on March 9, 2017 at 5:34 pm said:

swamidass,

Let’s leave a side that Darwinian processes is a strange limitation. The key thing is that a whole range of processes including neutral evolution and biophysics produces this amount of FI (as measured by MISA).

How would you claim that neutral mutation or biophysics can help work through the large sequence and find function?

I think your argument that MISA dramatically understates the amount of function available is what needs to be validated.

If there is more available function, is there enough so the law of information entropy won’t rapidly move the sequence to non function. Right now the DNA repair mechanism is what avoids this problem in living organisms. Maybe it should be part of the model.

That last comment I find really interesting. Where is the data that demonstrates that intelligence can produce complex functional proteins from scratch? This is a major open problem in science that no mind we know of can solve. The fact that we cannot solves this problem with our intelligent effort (and the closest we get is with random screening and massively parallel simulations), should make us doubt that logic greatly.

This is a very interesting comment. The question is if the origin of biological information can be understood based on scientific principles, or is it like gravity, electromagnetism, and the strong and weak nuclear forces where the cause appears to be out of reach of science. Humans can create sequences but as you state not of the complexity of many biological sequences.
Mung on March 9, 2017 at 11:50 pm said:

swamidass: Though to be clear, I am a theistic evolutionists.

I’m sure Kirk would agree with you that “without God, evolution is is impossible.”
Pedant on March 10, 2017 at 1:09 am said:

Mung: I’m sure Kirk would agree with you that “without God, evolution is is impossible.”

Why?
Tom English on March 10, 2017 at 4:52 am said:

swamidass: Here are a couple more quotes from the article I am critiquing, for those that doubt my representation of the claims being made here…

One of the notes I made on your manuscript was that you needed to add some brief quotations, if there were any, to show that your interpretation of Durston et al. is correct.
Joe Felsenstein on March 10, 2017 at 5:01 am said:

Dr. Swamidass, thanks for making your very nice paper available. Sorry to be late to this discussion — was busy with my “day job”.

As you note here, a good part of your argument is showing that a sequence alignment of a gene family that all carry out a function is not the same as a random sample from all proteins that can carry out that function as well. This, for the obvious reason, discussed by you and your co-author, that the sequences in the alignment are relatives, found by local “exploration” and not from a global search for all functional proteins.

Functional Information is related to Leslie Orgel’s “Specified Information” and to the original use of Complex Specified Information by William Dembski. As FI is used by Hazen et al., it is, IIRC, $-\log(p)$ , where $p$ is the fraction of all sequences that achieve as good a function, or better.

Your formulas for FI use the fraction of all sequences that have the function. Later you distinguish between lower and higher-functioning sequences. I would think that there should be, in your paper, a little more discussion of how much function is being regarded as “function”. Since sequences can evolve by natural selection from less to more function, if the threshold for “function” is set high, then if there were some way of assessing FI, a mistaken impression would be made that this cannot be achieved by natural evolutionary processes.