Here, one of my brilliant MD PhD students and I study one of the “information” arguments against evolution. What do you think of our study?
I recently put this preprint in biorxiv. To be clear, this study is not yet peer-reviewed, and I do not want anyone to miss this point. This is an “experiment” too. I’m curious to see if these types of studies are publishable. If they are, you might see more from me. Currently it is under review at a very good journal. So it might actually turn the corner and get out there. An a parallel question: do you think this type of work should be published?
I’m curious what the community thinks. I hope it is clear enough for non-experts to follow too. We went to great lengths to make the source code for the simulations available in an easy to read and annotated format. My hope is that a college level student could follow the details. And even if you can’t, you can weigh in on if the scientific community should publish this type of work.
Functional Information and Evolution
http://www.biorxiv.org/content/early/2017/03/06/114132
“Functional Information”—estimated from the mutual information of protein sequence alignments—has been proposed as a reliable way of estimating the number of proteins with a specified function and the consequent difficulty of evolving a new function. The fantastic rarity of functional proteins computed by this approach emboldens some to argue that evolution is impossible. Random searches, it seems, would have no hope of finding new functions. Here, we use simulations to demonstrate that sequence alignments are a poor estimate of functional information. The mutual information of sequence alignments fantastically underestimates of the true number of functional proteins. In addition to functional constraints, mutual information is also strongly influenced by a family’s history, mutational bias, and selection. Regardless, even if functional information could be reliably calculated, it tells us nothing about the difficulty of evolving new functions, because it does not estimate the distance between a new function and existing functions. Moreover, the pervasive observation of multifunctional proteins suggests that functions are actually very close to one another and abundant. Multifunctional proteins would be impossible if the FI argument against evolution were true.
Given that evolution is changes in gene frequencies in a population, your claims make no sense.
This is not my claim Mung. This is the claim that several ID and YEC leaders make. If you do not think it makes sense, you should go tell them.
Actually it is your claim. You referred me to your supplementary data. Do you mean the article by Kirk Durston? Is it your claim that Kirk Durston claims that evolution is impossible? Because that would be a poor reading indeed of what he wrote.
How would you explain his position? An “evolutionary search” for functional proteins is impossible? An “evolutionary process” that produces functional proteins is impossible? Or maybe this quote?
“The entire evolutionary capacity of the universe, if it were nothing but proteins recombining every second, is still more than 150 orders of magnitude too inadequate to expect to produce any of the 10^62 possible functional sequences for RecA.”
Orr maybe this one?
“A more formal way to evaluate the hypothesis that biological proteins were obtained in a blind evolutionary process is to apply the Universal Plausibility Metric (UPM), where an hypothesis is definitively operationally falsified if the UPM for that hypothesis is less than 1.[11] For RecA occurring somewhere within the universe during its history to date, the UPM = 10-142, which means that the hypothesis that it could be located by an evolutionary process is definitively falsified. ”
How would you read this statements? I think “evolution is impossible” is a pretty close approximation.
What a waste of time you are, Mung. This kind of crap is why people don’t respect you.
We’ve got a scientist calling for discussion of his research. This post should be featured. We should leave our fighting to other threads.
^^^ Seconded!
Dr. Swamidass,
Kirk is a friend of mine. But that said, I think the calculations in the paper are worth discussing.
Take for example an orphan protein/gene in a species. Since there is nothing to align it to, what would its MISA be? Suppose the protein in question is an enzyme with an enzymatic active site and allosteric sites, with a good number of positions changeable. It would seem FI and MISA in this case would clearly not be the same. Is that correct?
Next, suppose a protein/gene is orphan to only two or a small number of species, say primate specific proteins. Suppose further they are 100% identical in all the species they are found in. As you said, this would report a high MISA if the possibility of recent common ancestry is not incorporated into the MISA calculation. MISA would leave out a necessary correction for similarity due to common ancestry rather than similarity due to function.
If what I just said makes sense, and if indeed you accurately represent the ideas from the ID side, then in that sense I think you raise valid concerns. But, I’ve now said it for years, ID proponents need to consider dropping information theory arguments. As an electrical engineer in a former life, I studied Shannon’s theorems in grad communication classes. I’ve not seen good justification for incorporating Shannon like theories into arguments for or against evolution. It just makes a mess of things. I’ve also pointed out what looks like the futility of estimating the active information in a selective search such at Scott Minnich’s faster evolution of the Cit+ mutant in 19 days vs. Lenski’s Cit+ in 15 years. Can any ID proponent give a credible measure of the active information in Minnich’s 19 day search relative to Lenski’s 15 year LTEE? Even if they did, does it yield any insights above what we already know?
That said, I don’t think evolutionary theory is a successful theory at all. A successful biological theory is something like Mitchell’s chemiosmotic hypothesis for ATP synthesis or Krebs discovery of the TCA cycle, both of which got Nobel prizes. Noticing the similarity between species has been around since creationist Linnaeus. Similarity proves similarity, it doesn’t prove the feasibility of evolving complex function. One can assume common descent because of similarity, but it doesn’t mean novel features arise easily.
MISA does seem to identify the strength of conservation if the protein appears in all life forms.
Nice to see you, and God bless you.
In fairness to my ID colleagues, it would seem a charitable reading would allow for the assumption analyzing deeply conserved proteins vs. orphans. In the case of orphans, cassette mutagenesis experiments like Saur would act as a reasonable facsimile to create MISA measurements.
That said, I would argue, adding the layer of Shannon calculation is totally superfluous and is a gratuitous and un-necessary insertion of information theory where it adds only confusion rather than clarity.
The concern about proteins with essentially the same function, similar tertiary structure but very different sequences seem valid. I seem to recall on study said proteins with the same function and tertiary structure can have as little as 12% sequence homology! So these proteins may not get picked up in a BLAST.
I discovered it while arguing with Rumraket and Allan Miller over Lysyl Oxidase in a thread about Stephen Meyer’s book, Darwin’s Doubt. I’m sorry now that I’ve lost the reference to the study! If I get around to it I may re-visit that discussion.
All the forms of the the Lysyl Oxidase molecule won’t show up by blasting one gene or the protein sequence of one species because of the lack of sequence homology, even though there is tertiary homology.
Thanks for the kind reception.
I’ve talked to Kirk too. He is a nice guy and none of this is directed at him personally. My issue is with the math, and the high certainty some have about him “disproving evolution”. I’m a theistic evolutionist, but I am fine with someone debunking evolution if it really is false. But if it really is false, it should not take incorrectly applied math to debunk it.
You are right, almost.
First, your calculations are exactly right. The reduced sampling will increase the measured MISA, and it will be much higher than FI. However, care is taken to make sure that enough sequences are sampled so that the MISA levels off. So this example would be thrown out. What I show in this paper is that when sequences have shared history (common descent), then if we are just sampling extant sequences, then even if MISA levels off it is not a good estimate of FI. E.g. it takes hundreds of millions of years before the “memory” of the starting point is forgotten in mammalian evolution.
Essentially “history” is another source of shared information that is not accounted for if one just assumes that extant converged MISA = FI.
Once again almost. Once again, there are too few sequences for MISA to converge (see the Figures in the paper?), so these cases would be thrown out too. One would have to choose proteins that are in many more organisms.
MISA does measure strength of conservation (i.e. selection), but it also measures shared history and mutational deviations from uniformity. And the point at which selection (conservation occurs) for important proteins (e.g. DNA and protein replication) most definitively NOT at the boundary of function/non-function. Rather it is at the boundary of high/low function (qualitatively speaking).
For all these reasons, and more, MISA is a really bad way to estimate FI. It just does not work. We need some way of subtracting out the strength of these other sources of information, but this appears to be a nearly unsolvable problem.
One more contribution I noticed. I wonder if anyone else has noticed this too.
It seems that there is an attempt to coin this new term FSC. Elsewhere, FSC is equated with ASC. http://robertmarks.org/REPRINTS/2014_AlgorithmicSpecifiedComplexity.pdf
The part I find interesting is that ASC (as defined by Marks) and FSC (as defined in the paper I critique) are identical to mutual information. Usually, they compute mutual information under a uniform probability model, even though this is an incorrect model of evolutionary searches. There is a remarkable illusion that results with this proliferation of new terms to define old ideas. It seems like they are talking about something new, but they are not. Mutual Information (even of the Komolgrov sort) is a very well understood idea with a great deal of theory behind it.
All this theory ends up dismantling many of these information-based arguments. But by naming things differently and adjusting the variables, if one is not already very familiar with the formulas, the connections are not obvious. For example, Marks argues that because Komolgrov complexity is an upper bound (true!) that his ASC estimate is a lower bound (false!). Hidden in his formulation is the critical importance of the background distribution, which they assume in biology is always uniform. In evolution this background distribution (sampling around existing sequences) is very hard to model analytically, at its core it is non-parametric, but is easily going to be much much more efficient then uniformly random searches in protein space.
In the standard formulation for mutual information (which is algebraically equivalent to ASC under conditions I’ll describe later if people are interested), the error he makes in asserting ASC is a lower bound becomes obvious. But because he has redefined the term and changed the notation, this is no longer obvious. It becomes clear that ASC is actually just an upper bound that is reduced as the sampling distribution is understood, and is therefore not useful for his argument.
On that “16 days vs 15 years citrate evolution” stuff, here’s Zachary Blount:
https://telliamedrevisited.wordpress.com/2016/02/20/on-the-evolution-of-citrate-use/
Nobody does that. One concludes common descent due to nesting hierarchical patterns of similarity of entities known to imperfectly reproduce themselves(this directly predicts the pattern), not mere similarity.
One inherently yields a tree structure, and a tree structure is only predicted to exist from a stochastic evolutionary process of decent with independent modification, there’s no a priori reason to expect such a pattern on independent creation.
Would it though?
One could compute a MISA but this still would not equal FI.
Mutating a single protein would only sample from one family, not all the families with the function. There is no known way of estimating the number of additional families that should be sampled or for estimating their average size or their maximum size.
Also how would one make sure the threshold for function was correct? Minimal function is very hard to detect experimentally but might be enough to initiated a positive selection driven optimization.
Hi Dr Swamidass, you may be interested in a post I made last year discussing these issues (actually, being a bit sarcastic about them). Kirk Durston was kind enough to comment (and made me regret my tone somewhat!). But the main thrust is the unreliability of using a set of surviving and commonly descended sequences as if it were an unbiased sample of an entire space.
ISTM that evolution would have to be untrue, and proteins neither genetically related nor subject to selection, before this method of doubting evolution could gain traction.
stcordova,
And one valid interpretation of this fact of ‘conservation’ of structure while losing all sequence signal is the enormous interconnectedness of protein space. Because acids are not 20 completely different things, bitwise substitution can proceed to the extreme of random alignment, while the structure stays intact throughout, a Cheshire Cat smile constraining substitutiuon, but not absolutely.
Yup that is exactly right. Good post. And Kirk is a nice guy. I’ve felt bad for my tone with him more than once.
Good luck with that. Sal ignored me twice when I pointed to those replay experiments, he just keeps repeating the same crap.
Sal is confused by word ladder?
Thanks for pointing to that thread and it gave Kirk a chance to respond. Kirk’s a friend and well, though I think Dr. Swamidass raises important points, I don’t think Kirk (and the rest of the IDists) ideas are irredeemable.
I think trying to evaluate an entire protein at once is challenging. It might be better to decompose the problem into individual domains and any necessary secondary structures.
The individual domains are probably a more tractatable route.
Furthermore, I think sampling the tertiary space (at least computationally) is far more important than the sequence (primary) space. Then some probabilities might be more accurately arrived at.
One example of a domain:
https://en.wikipedia.org/wiki/Zinc_finger
Would we say that we need at least 4 residues (2 Cysteine and 2 Histidine) in the right place in a tertiary structure to make a zinc finger? That’s at least 8 bits (though I cringe having to covert it to bits since that is superfluous and gratuitous). That’s not 150 bits, but OK, this is a start.
stcordova,
It would be both much better and not nearly good enough! When you have domains, and a mechanism by which a domain may migrate within a protein, or be donated to a completely different one, you aren’t really any closer to having a sensible handle of the origin of the domain. For a given domain in a given peptide, its origin was almost certainly outside of the protein in which you are currently looking.
Any method you use by ‘sampling’ spaces or subspaces using extant sequences suffers from the same problem – the strong biases imparted by selection and common descent on your data set.
Yep, using bits is a bit fingers-on-a-blackboard for me. We can do so because these are constructed as linear polymers, so have the potential to be analysed as strings. But the strong interactions between ‘bits’, and the way changes affect the lability of subsequent amendments, makes bit-analysis a long way from satisfactory as a representation of possibility. Even worse when people insist that 20 acids == 20 different symbols (valid for Cys, Pro, maybe His, but not much else).
Agreed. Thank you for your response.
Joshua,
I’m stuck on Equation 2. Is
a conditional probability, or is it a joint probability?
1. I don’t see the sense in treating the sequence index as a random quantity.
2a. If
is the conditional probability of amino acid
given that the index is
then the first term on the right-hand side has got to be the entropy
of the random amino acid, which I’m denoting
But I don’t get it.
2b. If
is the joint probability of amino acid
and index
then the first term has got to be the sum of entropy values
and
where
denotes the random index. But I don’t get it.
Thanks for looking at the math Tom.
P(a at i) is neither the conditional probability nor the joint probability. It should be read as “the probability that we find amino acid a at position i in the alignment.” This is just the normalized count of the amino acid at that position in the alignment.
The double summation (with the right sign) is just the entropy of the PSSM. The log term outside the summation is the entropy of a uniform distribution of amino acids in the PSSM (i.e. the maximum entropy possible). So we have maxent – measured entropy = mutual information according a uniform distribution. And this is just copying the original formula they use in a easier (to me) to follow notation. You can see the relationship between MI and these entropies (H) here: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities
Though looking closer at your definition this might be equivalent to how you are defining conditional probability. Though I do not usually think of it that way because i is not really a random variable and we store the PSSM in array that just lets us index it.
Sorry. I’m familiar with information theory, but I can’t make sense of this. I see now that
is the entropy of a uniform distribution on the space of all length-
sequences over the set
of 20 amino acids. Let’s write
for the random variable with uniform distribution of probability on
Then we’ve got
minus what? It looks like
where

is a random sequence over the amino acids, and the random variables
are mutually independent. Now, the relative entropy (Kullback-Leibler divergence) isThis seems to match your Equation 2. I do not see a match to any of the following expressions of mutual information.
(1)
(Note: I put the
source between tags [latex] and [/latex]. Also, to suppress markdown, put backquotes before and after the
region. It’s hard to fix problems in the comment editor, so I compose in my usual environment, and copy-and-paste to the comment.)
Joshua,
I’ve now taken a quick look at Durston et al. They write:
I don’t find references to mutual information, relative entropy, or divergence in the paper. It seems that they stick with “change in functional uncertainty.” So I claim that when the “null state” distribution is uniform, the difference in entropy is the relative entropy (Kullback-Leibler divergence).
I’d read them in context. And I’d try to accurately represent his actual views. And if I had doubts I might even send him an email and ask.
Durston:
He admits novel functions can be achieved by an evolutionary process.
To put it another way, he doesn’t claim that evolution is impossible.
By the way, isn’t that last
subscript a typo? Looking at what comes immediately before, I think the last term should be
.
Tom, thanks for looking at this.
I should emphasize two things. (1) there are multiple ways of deriving this formula. (2) they do not use the term “mutual information” but invent the term FSC, even though the formula is the same. So I am just pointing out that the formula is exactly the same as MI, and this the term I use.
To derive this, I am using the starting identity.
I(PSSM, FUNC) = H(PSSM) – H(PSSM | FUNC)
Where PSSM is the the position specific scoring matrix, and FUNC is the presence of a specified function. Now I just compute the entropy of these things assuming the PSSM follows a uniform distribution when not conditioned on function.
H(PSSM) = -L * 20 / 20 * log 1 /20
= – L log 1 /20
And then we compute the entropy of the PSSM of sequences with a given function.
H(PSSM | FUNC) = – sum_i sum_a P(a at i) log P(a at i)
And now this gives the MI of a sequence alignment, MISA…
I(PSSM, FUNC) = – L log 1 /20 + sum_i sum_a P(a at i) log P(a at i)
So that is what they are computing when they compute FSC. It is algebraically equivalent, to MI with a maxent prior distribution and requires no new term or theory to describe.
This computation computes the number of “bits” of information the members of the family share in common. Because they all have common function, the intuition is that that common information is what defines the function. The intuition makes sense, but is only correct if (1) there is only one cluster of functional sequences and (2) we are uniformly randomly sampling sequences. Both these assumptions are false, so MISA is not a good estimate of FI.
As I demonstrate in the simulation, MISA also encodes information about the sampling distribution, which is not uniform, and is strongly shaped by common ancestry, selection, and mutational bias.
Does that make sense?
I should also clarify that this, essentially, is what they assert.
But I show in the simulations that they are instead computing.
H(PSSM) – H(PSSM | FUNC, MUTATIONAL_BIAS, HISTORY, SELECTION)
Or rather,
I ( PSSM, FUNC, MUTATIONAL_BIAS, HISTORY, SELECTION)
They have to parse out each of these independent contributions to get an estimate of I(PSSM, FUNC), but they do not seem to realize this is even an issue.
This is one reason I strongly disfavor the term “FSC” to describe MISA. That tries to settle the scientific question with a definition. It makes the whole conversation confusing. That is why I fall back to the neutral term MISA, and ask if MISA is a good estimate of FI. It is not.
And one final thought. One step forward would be to estimate..
I(PSSM, FUNC) (APPROX=) H(PSSM | MUTATIONAL_BIAS, HISTORY, SELECTION) – H(PSSM | FUNC, MUTATIONAL_BIAS, HISTORY, SELECTION)
To do this, one would have to explicitly model mutation, history and selection its influence on the entropy of the PSSM. And then compare this with observed PSSM. I present one way of doing that to account for history. However, doing this for selection appears intractable. And this comes with its own problems: assuming the local sampling gives a good estimate of the global space (which is false) and that there is only one family (which is also false).
So while that is the way forward, I do not think it is possible to tractably compute this. Regardless, FI tells us nothing about evolvability any ways.
Which I have done! I have demonstrated here that mindless natural processes produce arbitrarily high MISA, which is his way of computing FI. There are only two ways forward. (1) I have demonstrated FI is easily generated by mindless processes or (2) MISA is a horrible estimate of FI. Pick your poison.
swamidass,
I’m not agreeing with Durston. Just pointing out that Mung is missing Durston’s obvious point — a point to which you are ably responding.
I know. I was mainly “responding” to Durston’s challenge.
I’ll have to come back to this later — probably not until tomorrow. But scanning over what you’ve written, I have a question: Is your Equation 2 equivalent to their Equation 6, or is there something else going on?
Moved a comment to guano Noyau is available for insulting.
ETA and a comment quoting the guano’d comment – for continuity.
The guanoed comments can be found here.
keiths,
And here
swamidass:
Andreas Wagner’s book Arrival of the Fittest discredits those two assumptions in a very readable way, which is why I frequently recommend it to ID proponents.
For example, here Wagner discusses how proteins having the same fold and function form vast networks rather than small clusters:
This is a bit tricky to relate. The real question is if all these globins can all be aligned. They might, and then this story would not invalidate the MISA approach (on its own). As nice of a qualitative picture it pantes, it is not quantitative.
An important point here is that “small” clusters is very relative. One of the “small” clusters that Kirk calculates is “just” about 10^60 members. That is a gigantic number. The thing is that this has to be compared with a much much larger sequence space: 20^150 for proteins 150 aa long. Given this, it is entirely possible that a “small” cluster can produce the expansive network that Wagner describes.
What I like about Kirk’s work is that he is moving beyond these sorts of stories to try and compute something from the data. I affirm that entirely. The problem though, is that his number is not telling what he thinks it is.
I should add that PZ Meyers totally failed in his debate with Kirk for this reason, and also in the follow up blog post. As smart of a guy he is, he did not understand the math here.
swamidass,
A longer reply later, but for now I’ll note that a network extending 90% of the way across sequence space isn’t “a small cluster”, regardless of how many members it contains relative to the whole.
In other words, “small” is a function of extent as well as volume. There’s an enormous evolutionary difference between 10^60 members crammed into a hypersphere vs 10^60 members spread out across 90% of the diameter of sequence space.
If sampling were random and uniform, then extent wouldn’t matter; only volume would. But as you point out, the assumption of random uniform sampling is entirely bogus. Evolution doesn’t work that way.
I often ask people like KF when they talk about this sort of thing what process is doing the sampling? So far I’ve not yet had an answer, obviously because when they stop and think about it they realize there is no “random sampler” module that pokes about in sequence space randomly on the off chance of a hit.
Yet, somehow, that has never stopped their claims. Hopefully they’ll read this thread 🙂
OMagain,
Yes, my own response to the generalised, simplified v^n analysis of protein space has crystallised into wondering how people imagine such an implicit independent-probability sampler to be implemented. KF is big on the ASCII gambit – as if all string spaces have the same properties.
You got it,. Nothing else going on.
I’m using the same formula, but using the standard name for it. There is no reason to coin a new term, so I do not use FSC. Like I said, the term FSC (Functional Sequence Complexity) sort of assumes the conclusion by using the word “Functional”. However we label the formula, this is just a number arrived at by this formula. We actually have to test if it measures what we think it does. The theoretical analysis alone is not enough. So the real question: does MISA (i.e. FSC) approximate FI? The answer is “no”.
Here are a couple more quotes from the article I am critiquing, for those that doubt my representation of the claims being made here…
QUOTE: “The idea that a Darwinian process can produce the kind of functional information required to code for the average functional protein family, not to mention all of biological life, is a popular one, but when actually tested against real data, is definitively falsified.”
Let’s leave a side that Darwinian processes is a strange limitation. The key thing is that a whole range of processes including neutral evolution and biophysics produces this amount of FI (as measured by MISA).
QUOTE: “For RecA occurring somewhere within the universe during its history to date, the UPM = 10-142, which means that the hypothesis that it could be located by an evolutionary process is definitively falsified. On the other hand, RecA requires only 832 Fits (Functional Bits) of information to sequence, a quantity of functional information that intelligence can easily generate.”
That last comment I find really interesting. Where is the data that demonstrates that intelligence can produce complex functional proteins from scratch? This is a major open problem in science that no mind we know of can solve. The fact that we cannot solves this problem with our intelligent effort (and the closest we get is with random screening and massively parallel simulations), should make us doubt that logic greatly.
Though to be clear, I am a theistic evolutionists. I believe God created us through evolution, so in an ultimate sense I do think this is product of His “Mind.” I do not think that science can rule out God didn’t proximately produce this either. Though, I do not think it makes much sense to reduce God to “intelligence.” I put this here so that hopefully my critics will see that I am not arguing against God at all. I’m just point out what looks like, to me, to be incorrect math. If God created us, which I believe He did, I am not sure what He needs of incorrect math.
swamidass,
How would you claim that neutral mutation or biophysics can help work through the large sequence and find function?
I think your argument that MISA dramatically understates the amount of function available is what needs to be validated.
If there is more available function, is there enough so the law of information entropy won’t rapidly move the sequence to non function. Right now the DNA repair mechanism is what avoids this problem in living organisms. Maybe it should be part of the model.
This is a very interesting comment. The question is if the origin of biological information can be understood based on scientific principles, or is it like gravity, electromagnetism, and the strong and weak nuclear forces where the cause appears to be out of reach of science. Humans can create sequences but as you state not of the complexity of many biological sequences.
I’m sure Kirk would agree with you that “without God, evolution is is impossible.”
Why?
One of the notes I made on your manuscript was that you needed to add some brief quotations, if there were any, to show that your interpretation of Durston et al. is correct.
Dr. Swamidass, thanks for making your very nice paper available. Sorry to be late to this discussion — was busy with my “day job”.
As you note here, a good part of your argument is showing that a sequence alignment of a gene family that all carry out a function is not the same as a random sample from all proteins that can carry out that function as well. This, for the obvious reason, discussed by you and your co-author, that the sequences in the alignment are relatives, found by local “exploration” and not from a global search for all functional proteins.
Functional Information is related to Leslie Orgel’s “Specified Information” and to the original use of Complex Specified Information by William Dembski. As FI is used by Hazen et al., it is, IIRC,
, where
is the fraction of all sequences that achieve as good a function, or better.
Your formulas for FI use the fraction of all sequences that have the function. Later you distinguish between lower and higher-functioning sequences. I would think that there should be, in your paper, a little more discussion of how much function is being regarded as “function”. Since sequences can evolve by natural selection from less to more function, if the threshold for “function” is set high, then if there were some way of assessing FI, a mistaken impression would be made that this cannot be achieved by natural evolutionary processes.