Evolution and Functional Information

Here, one of my brilliant MD PhD students and I study one of the “information” arguments against evolution. What do you think of our study?

I recently put this preprint in biorxiv. To be clear, this study is not yet peer-reviewed, and I do not want anyone to miss this point. This is an “experiment” too. I’m curious to see if these types of studies are publishable. If they are, you might see more from me. Currently it is under review at a very good journal. So it might actually turn the corner and get out there. An a parallel question: do you think this type of work should be published?

 

I’m curious what the community thinks. I hope it is clear enough for non-experts to follow too. We went to great lengths to make the source code for the simulations available in an easy to read and annotated format. My hope is that a college level student could follow the details. And even if you can’t, you can weigh in on if the scientific community should publish this type of work.

Functional Information and Evolution

http://www.biorxiv.org/content/early/2017/03/06/114132

“Functional Information”—estimated from the mutual information of protein sequence alignments—has been proposed as a reliable way of estimating the number of proteins with a specified function and the consequent difficulty of evolving a new function. The fantastic rarity of functional proteins computed by this approach emboldens some to argue that evolution is impossible. Random searches, it seems, would have no hope of finding new functions. Here, we use simulations to demonstrate that sequence alignments are a poor estimate of functional information. The mutual information of sequence alignments fantastically underestimates of the true number of functional proteins. In addition to functional constraints, mutual information is also strongly influenced by a family’s history, mutational bias, and selection. Regardless, even if functional information could be reliably calculated, it tells us nothing about the difficulty of evolving new functions, because it does not estimate the distance between a new function and existing functions. Moreover, the pervasive observation of multifunctional proteins suggests that functions are actually very close to one another and abundant. Multifunctional proteins would be impossible if the FI argument against evolution were true.

216 thoughts on “Evolution and Functional Information

  1. swamidass,

    He claims to have “falsified” evolution with his FI argument. It is the other way around, I have falsified his argument.

    What argument of Kirk’s did you falsify? What was the method you used to falsify it?

  2. Mung: Evolutionary models to date point strongly to the necessity of design. Indeed, all current models of evolution require information from an external designer in order to work. All current evolutionary models simply do not work without tapping into an external information source.

    – Marks, Dembski, Ewert, and Humble

    In this brilliant insight, Dembski, Marks, Ewert and Humble have discovered that all simulations of evolution that were programmed on computers, were intentionally programmed to be simulations of evolution.

    It should be needless to say, but what matters is of course the actual process that takes place in the simulation, not the fact that the simulation was programmed (all simulation are, that’s unavoidably the case with simulations, otherwise they wouldn’t be simulations).

    Or as others have pointed out, what they’re basically saying is that for evolution to work, there has to be a world of some sort with laws of physics and so on (these laws are then what Demski et al claim is the “information” that has been put there beforehand). But then it becomes obvious that their argument is begging the question.

    People who fall for their argument and can’t instantly see through it, are idiots. Or so blinded by confirmation bias and tribalism it borders on mental illness.

  3. This dozy notion has been done to death. Simulations of evolution can be built without the programmer ever knowing what the run time parameters causing differential genotype retention are going to be. You can get a dog to pick ’em, or – as Dawkins mused – bees.

  4. colewd: What argument of Kirk’s did you falsify? What was the method you used to falsify it?

    He claims that the generation of sequences with high amounts of MISA (he calls it FSC) is a unique capability of a mind. I demonstrate that simulations of several mindless processes can generate sequences with very high MISA. Therefore, generation of sequences with high MISA is not a unique capability of a mind.

  5. Dr. Swamidass:

    Glad to see that you’re still checking in. I have higher-level issues to discuss with you (in my own sweet time).

    The biggest defect in your paper, and also your notebook, is very big indeed: you do not specify biological models. For whom are you writing? I say that you should write first for biologists, and second for scientists in general. If you do a good job of the latter, then you will reach a large number of scientifically literate laypeople (I’m thinking of some papers I have read in PNAS). You should never sacrifice the accuracy of a scientific paper to “readability.” I put “readability” in scare quotes because misreporting your work, in order to give general readers the impression that they understand it, is not increasing the readability of a report of your work. You have in fact misreported your work, and I prefer to suppose that you and your brilliant student did so in a misguided attempt to make the paper “readable,” rather than by incompetence.

    There should have been no need for me to work at inferring a model. You’ve provided a fine example of how poorly code serves as a specification of a model. (This matter is much on my mind, because the fundamental error of Marks, Dembski, and Ewert is to mistake the simulation of an evolutionary process for the evolutionary process itself.)

    In the first simulation, as mutation rate increases to several events per amino acid, the substitution rate approaches 100% and the MISA of the extant sequences approaches zero bits. This is only true because the simulation uses a mutational model that uniformly substitutes amino acids.

    Taking this as a characterization of your model, as I of course should have done, I observed that your implementation of the model is incorrect. Rather than apologize for your misreporting — no, the passage is not just poorly worded — you turned defensive and evasive. You have since told me about a part of the model you have in mind, and have spoken vaguely of something or another not making much of a difference. The exchange we’ve had is repugnant to me, and I regret my involvement. I’m hoping for better in the future.

    Under the disgusting assumption that the code implements whatever model it is that you have in mind, the amino acid substitution rate approaches 95% from below, as the parameter \lambda of the Poisson distribution (the expected number of mutational events per amino acid?) goes to infinity. When the expected number of base mutations is \lambda=3, the substitution rate is only 90%. Please explain the biological relevance to me. What process are you modeling? What is the model? Please do not tell me how simulators commonly work. Please do not tell me about an approximation without describing (1) the quantity that you’re approximating, and (2) the circumstances under which the approximation is good.

  6. colewd: What argument of Kirk’s did you falsify? What was the method you used to falsify it?

    First he devised a scientific method to distinguish a guided process from an unguided process. He should have published that paper first as it lays the foundation for the present paper. Cart before horse and all that.

  7. swamidass: I demonstrate that simulations of several mindless processes can generate sequences with very high MISA.

    How do you measure “mindlessness” or “mindedness”, scientifically?

    IOW, how did you establish that the processes you are simulating are in fact mindless processes? Just assume it? Do they take place in parts of the universe where God isn’t present?

  8. swamidass: He claims that the generation of sequences with high amounts of MISA (he calls it FSC) is a unique capability of a mind. I demonstrate that simulations of several mindless processes can generate sequences with very high MISA.

    In the model Durston et al. define, evolution increases MISA. (Whether they abandon the model, later in the paper, I don’t yet know.) In whatever model it is that you have in mind, drift (the term you use in your notebook, but not in your paper) decreases MISA.

    I’ve brought this up before. I see no sense in referring to reduction in MISA as generation of MISA.

  9. Tom,

    Now you might check to see what degrees UC-Irvine awards.

    Why? You’re the one asking the question, not me.

  10. Mung: Is this the same Dr. Swamidass that thinks that information = entropy?

    I don’t think much of your ignoring his qualifying remarks. But I think even less of this:

    [Swamidass:] As a quick introduction, my field of study is computational biology, with specific expertise in applying information theory to biology. My PhD is specifically in “Information and Computer Science”, with emphasis (in my case) on information.

  11. Tom,

    You were replying to me when you wrote this:

    Now you might check to see what degrees UC-Irvine awards, as I did after checking his vita.

    Hence my response:

    Why? You’re the one asking the question, not me.

    This is not difficult.

  12. Not sure why you think I am defensive. I’m thankful for your comments and happy to answer your questions. You have already identified I couple confusing points and implementation details I plan to change. That is the opposite of defensive =).

    Tom English: Under the disgusting assumption that the code implements whatever model it is that you have in mind, the amino acid substitution rate approaches 95%

    I see your point now. That is an error in the text. My error. It should be the number of amino acids mutated goes to 100%. Thanks for that catch.

    Tom English: In the model Durston et al. define, evolution increases MISA. (Whether they abandon the model, later in the paper, I don’t yet know.) In whatever model it is that you have in mind, drift (the term you use in your notebook, but not in your paper) decreases MISA.

    I’ve brought this up before. I see no sense in referring to reduction in MISA as generation of MISA.

    I think you misunderstand the point of the simulation. This is probably is my fault, as you are clearly reading the paper carefully. (thanks for that) I’ll do some work to rewrite that part. Let me try here…

    Durston is calculating the MISA of extant sequences. This is important. So, we can ask how much MISA is computed from the extant sequences in a generative model of sequences that approximates evolution.

    What I am modeling is several simplified models of evolution, to generate extant sequences. Each one tests the effect of different details of evolution. That is why they increase in complexity in a stepwise way. I am not claiming these are 100% accurate simulations, but that they are good approximations. Anyone who doubts this could implement the less efficient but more accurate simulations to see the results are nearly equivalent.

    And yes, in the most simple model (uniform) after enough time has gone by, MISA equals zero. But how much time is that? Quite a bit. But that is not the situation we are dealing with here with Durston’s calculation. He is not computing the MISA of extant sequences as a function of time. He is just taking the sequences we have now and using those. Each of these families has a different age, which corresponds (loosely) to a certain amount of mutation.

    But in the more complicated models, MISA does not go to zero with time any ways. So even if there is a very old family, there is still going to be a non-zero MISA.

    You are seeing MISA go down as a function of time, and I agree. However, that is not the question at hand. The question is whether MISA can be substantially high at any point in the simulation. It can.

    Tom English: Is your terminal degree the Ph.D. in Informatics, awarded by UC-Irvine?

    Not sure what to make of this line of questions. As you can see, my CV is on the web freely available for all to see. I hope this is just an innocent question and not accusatory.

    Mung: Is this the same Dr. Swamidass that thinks that information = entropy?

    Yup that is me. Though I cannot take credit for that insight. That is straight from Shannon.

  13. swamidass: It should be the number of amino acids mutated goes to 100%. Thanks for that catch.

    I cannot make any sense of this. Did you mean to write “bases” instead of “amino acids”? If so, I ask if there’s any case in which mutation of all three bases in a codon yields a synonymous codon. My (perhaps faulty) recollection is that there is not.

    More importantly, when the amino acid substitution rate is 1, the MISA of the set of extant proteins does not converge to zero as the size of the set increases. The substitution rate of 19/20 is indeed the rate that maximizes the empirical entropy, and hence minimizes the MISA, in the limit.

    You distinguished sampling with and without replacement, in a comment above. I got the impression that you were saying that there could be multiple mutations of a single base in a triplet. That doesn’t make sense in reproduction of an organism. But I’m sure that you’re modeling something else, though I don’t know what that something else is.

  14. swamidass: Durston is calculating the MISA of extant sequences. This is important. So, we can ask how much MISA is computed from the extant sequences in a generative model of sequences that approximates evolution.

    Sequences are generated. MISA is not generated.

  15. swamidass: What I am modeling is several simplified models of evolution, to generate extant sequences. Each one tests the effect of different details of evolution. That is why they increase in complexity in a stepwise way. I am not claiming these are 100% accurate simulations, but that they are good approximations. Anyone who doubts this could implement the less efficient but more accurate simulations to see the results are nearly equivalent.

    Would you please identify the “more accurate simulations.” Perhaps then I could answer the question simulation of what? I’m looking for an answer a bit less broad than evolution.

    If your objective was to produce an efficient simulation, then you failed. The first thing I tossed off is vastly more efficient than what you’re using. When I thought a bit about what I was doing, I sped my code up by a factor of 50. The reason I bothered was that I wanted to compare Bernoulli and Poisson mutations for a bunch of runs (do statistical tests) to make sure that I had gotten things right — which I had. The trick is to represent amino acids as unit vectors. Then a sum of amino acids (vectors) in position i in the alignment set is the frequency distribution of amino acids in that position. If you’re going to do more simulations of this sort in Python, then it’s a very good trick to know.

    time m=sample1_hacked(1, 150, 10**4)
    CPU times: user 848 ms, sys: 43.5 ms, total: 892 ms
    Wall time: 892 ms

  16. Look let’s start from the beginning here on the uniform simulation.

    Imagine a single ancestral sequence at a point in time when 1,000 species diverge. We are going to take that single protein sequence and see how it drifts in each of these 1,000 species.

    First pick (1) an ancestral sequence of a certain length L, (2) a single amount of time T and (3) a mutation rate per amino acid per unite time M. The average number of mutations we expect over this time is L * T * M.

    Now, to generate an extant sequence for the ancestral sequence…

    1.Initialize the extant sequence to the ancestral protein sequence.
    2. Let X be one sampled value from a poisson distribution with rate L*T*M to determine how many total mutations will happen in the history of this sequence.
    3.Now, uniformly random pick one amino acid in the sequence, and mutate it to another amino acid. You can ensure the mutation is to a new amino acid. But you must make sure that reversion to ancestral is possible. A amino acid must be able be picked multiple times, and after it is mutated once, it can be mutated back to the first amino acid again.
    4. Repeat #2 and #3 a total of X times.
    5. Return the extant sequence.

    Now repeat this 1,000 times to generate a large number of extant sequences. Compute the MISA of this collection of sequences. Now, if MISA is a good estimate of FI, this number should always be very close to zero. In this simulation, there are no functional constraints. All sequences would work. You will find that it is not.

    It is clear that you do not trust the approximations we make in the supplementary data. You can build your own simulation that directly implements this algorithm if you want. But the key thing is that your bernoulli based simulation you propose does not approximate this algorithm.

    To be clear, this is just an approximation of neutral drift for several reasons.

    1. 1,000 species do not usually separate at once like this. Usually they will diverge in a somewhat tree like structure. Can you tell the effect using a tree will have on MISA? You could even build the simulation this way if you wanted.
    2. It assumes that amino acids mutated with a uniform distribution. But we cover this in future sections. You can build that behavior in if you want, just have we. You will see the results.
    3. It assumes mutation rate is identical in all lines. You can try modifying this to see its effect too.

    I could go on and one. There are other simplifications here. This is far from a fully accurate simulation of neutral drift. For example, it does include indels, duplications, explicitly model DNA, it does not treat transitions and transversions differently, it does not maintain GC content, etc. That is not the point though. Rather we are modeling a single feature of neutral drift (a common ancestral sequence) to isolate the effect on MISA of this mechanism.

    Hope that make sense Tom. I cannot promise I will be able to continue the discussion here much longer. If you have additional questions please email me.

    Peace.

  17. This is fun. I called the function 40 thousand times to generate the 40 thousand data points plotted below. The total running time was 28 seconds.


    import numpy as np
    import numpy.random as rand
    from scipy.stats import entropy

    N_ACIDS = 20
    H_MAX = log(N_ACIDS, 2)

    def misa1(rate, length=150, sample_size=10**4):
        """
        Return MISA for a sample of amino acid sequences of given length.
        """
        p = np.ones(N_ACIDS) * (1 - np.exp(-rate)) / N_ACIDS
        p[0] += 1 - np.sum(p)
        counts = rand.multinomial(sample_size, p, length)
        return length * H_MAX - np.sum(entropy(np.transpose(counts), base=2))

  18. Mung:

    swamidass: I demonstrate that simulations of several mindless processes can generate sequences with very high MISA.

    How do you measure “mindlessness” or “mindedness”, scientifically?

    IOW, how did you establish that the processes you are simulating are in fact mindless processes? Just assume it? Do they take place in parts of the universe where God isn’t present?

    I know you miss JoeG/Frankie, Mung, but that doesn’t mean you have to act like him.

  19. Tom English:
    This is fun. I called the function 40 thousand times to generate the 40 thousand data points plotted below. The total running time was 28 seconds.

    That works! Though I would have to think a bit more careful to ensure your first two lines in the function are exactly right.

    Though I do notice that this uses the same “sampling with replacement” trick we were using in our code. That appears to be a necessary trick to get optimal efficiency.

    The reason why we do not use your version is because we wanted to explicitly generate the extant sequences so it is more clear to readers. Also, as you start to add more and more analytic computations, it looks less like a simulation, and more like a prediction of what the simulation will produce. In this case, interestingly enough, there is a closed form analytic solution to the results of the simulation. Your simulation is starting to approach that analytic solution.

    Regardless, at least you see what I meant by this. Thanks for your engagement Tom.

  20. swamidass: But the key thing is that your bernoulli based simulation you propose does not approximate this algorithm.

    swamidass: I cannot promise I will be able to continue the discussion here much longer.

    [Joshua, see the note below before reading this.]

    You keep coming back with that pronouncement of yours, and you keep neglecting to provide details. I insist that you stick around long enough to resolve this one point. I derived an equivalence mathematically. You might have shown where I’d gotten the distribution of your sample wrong. You did not. You might have shown where the derivation was wrong. You did not. We should not even be talking about code. But you might have run the code that I supplied (multiple times, to obtain mean values of MISA), and compared the results to yours. You did not, though you seem to grasp code better than math. Now I will resort to asking you yes-no questions about code for sampling the space of amino acid sequences.

    The amino acids are stored in a rectangular matrix, with one row for each position in a protein, and one column for each sampled protein. That is, the array dimensions are (length, sample_size), where length is equivalent to your L, and sample_size is the number of lineages diverging from the ancestor.

    1. Do you agree that the following selects amino acids in all positions uniformly at random?

    mutation_outcomes = rand.choice(AMINO_ACIDS, (length, sample_size))

    2. Do you agree that the following compares Poisson random variates, one for each AA in the sample, to 0, and stores True for AAs that are mutated (at least once), and False for AAs that are not mutated at all?.

    where_mutated = rand.poisson(rate, (length, sample_size)) > 0

    3. Do you agree that the following yields uniformly random amino acids in positions where mutation occurs (at least once), and the ancestral amino acid (arbitrarily taken to be AMINO_ACIDS[0]) in positions where there is no mutation?

    sample = np.where(where_mutated, mutation_outcomes, AMINO_ACIDS[0])

    That is your sample: Uniformly random in positions where mutation occurs at least once, and identical to the ancestral protein in positions where there is no mutation. Please, please, please tell me we can agree on that.

    I am not falling yet again into the trap of allowing you simply to pronounce on the equivalence that I derived formally. You can copy-and-paste the following code, and then execute compare(). You can do that. And you can wait two minutes (probably less on your machine) for a comparison of 1001 runs sampling as described above, and 1001 runs sampling as I have described for several days now. There is a t-test for each sample size in {2, 3, … 1000}. Of course, larger samples contain smaller samples, and what I’m doing has to be interpreted with some care. But it is a reasonable way to reject your unsupported claim that the rates of convergence are different.


    from math import log
    from scipy.stats import entropy, poisson, ttest_ind
    import numpy as np
    import numpy.random as rand

    N_ACIDS = 20
    H_MAX = log(N_ACIDS, 2)
    AMINO_ACIDS = np.arange(0, N_ACIDS)
    AA_VECTORS = np.zeros((N_ACIDS, N_ACIDS), dtype=int)
    np.fill_diagonal(AA_VECTORS, 1)

    def sample1_hacked(rate, length=150, sample_size=10**4, poisson=False):
        """
        Return MISA values for samples of size 1, 2, ..., `sample_size`.
        
        The random number of base mutations to an AA is Poisson(rate)-distributed.
        
        Assume that for 1 or more mutations to bases in an AA, the result is on
        average uniformly distributed on the set of all AAs. That is, the
        probability is 1/20 that the codon of a mutant is synomymous with the codon
        for the ancestral amino acid. Under this assumption, substitutions are
        i.i.d. Bernoulli, occurring at the rate 19/20 * (1 - exp(-rate)).
        """
        if poisson:
            # where_mutated = poisson(rate).rvs((length, sample_size)) > 0
            where_mutated = rand.poisson(rate, (length, sample_size)) > 0
            mutation_outcomes = rand.choice(AMINO_ACIDS, (length, sample_size))
            sample = np.where(where_mutated, mutation_outcomes, AMINO_ACIDS[0])
        else:
            substitution_rate = (N_ACIDS - 1.0) / N_ACIDS * (1 - np.exp(-rate))
            where_substituted = rand.sample((length, sample_size)) < substitution_rate
            substitutions = rand.choice(AMINO_ACIDS[1:], (length, sample_size))
            sample = np.where(where_substituted, substitutions, AMINO_ACIDS[0])
        h = [ entropy(np.transpose(np.cumsum(AA_VECTORS[sample[i]], axis=0)))
                for i in xrange(length) ]
        return length * H_MAX - np.sum(h, axis=0) / log(2)

    def compare(rate=0.5, length=100, size=10**3, nruns=1001):
        """
        Run two versions of sampling, do `size` t-tests on mean MISA values.
        
        Return percentage of p-values less than .05, along with `size-1` p-values.
        """
        bernoulli = [ sample1_hacked(rate, length, size) for i in xrange(nruns)]
        poisson = [ sample1_hacked(rate, length, size, True) for i in xrange(nruns)]
        results = map(ttest_ind, np.transpose(bernoulli), np.transpose(poisson))
        pvalues = np.array(map(lambda r: r[1], results[1:]))
        return np.sum(np.where(pvalues < .05, 1, 0)) / (size - 1.0), pvalues

  21. swamidass: That works! Though I would have to think a bit more careful to ensure your first two lines in the function are exactly right.

    I was composing when you posted. The thing to understand now is that I got to the multinomial distribution by way of the binomial, and to the binomial by way of Bernoulli substitution events.

    I think some of the explanation you wrote for me belongs in the paper (with some polish added, of course). This has been frustrating for me, and I know also for you. But I genuinely do intend to move to higher-level concerns. And, as you have noticed, I go long periods without posting. So it’s not going to take a huge amount of your time to check in.

    Things will be a lot easier for you if you do not repeat what’s in your paper. I really have read it. I’m trying to get you to write stuff that’s not in the paper, and not in the notebook. I have not read the note by Durston that you’ve included as a supplement. Perhaps I’ll see in that what I don’t see in Durston et al.

  22. Mung,

    IOW, how did you establish that the processes you are simulating are in fact mindless processes? Just assume it? Do they take place in parts of the universe where God isn’t present?

    If his program is generating random sequences, how does he know the generated sequences are functional. Kirk’s thesis is that FI requires a mind. Dr Swamidass is changing that argument to large levels of MISA require a mind.

    It appears that the paper is based on a straw-man argument. Kirk is using MISA only to estimate FI or SSFI (Statistically significant).

    If he looks at a historical functional protein then he knows that that variation works in a group of living organisms. If he generates random variation from a functional protein he does not.

  23. colewd: how does he know the generated sequences are functional

    You’ve evidently read a different paper than I have. Matlock and Swamidass say repeatedly that the sequences are all nonfunctional, and my question is how that can be.

  24. Dr. Swamidass:

    I apologize sincerely for my behavior last night. You were annoying, but only mildly so. I unloaded on you instead of the actual irritant. That’s not an excuse. I say it in hope that you’ll believe that I won’t be doing anything like that again. What I did was categorically wrong.

  25. Tom English,

    Yes, you are right.

    2.1 Uniform evolution of non-functional proteins
    This first and most important simulation demonstrates that mindless evolutionary processes can produce sequences with very high MISA. This simulation samples several extant sequences a fixed number of mutational events away from a single ancestral sequence. There are no functional constraints, so the FI by definition equals zero
    (Equation 1). If MISA is a good estimate of FI, it should converge to zero bits too.

    Is he really generating MISA here in the same way that Kirk is using it to measure FI? Kirk is looking at functional sequences. Do we have apples to oranges comparison here? If the sequence is non functional what does this mean?

    I would concede the point that we can generate non function randomly.

  26. Tom English: I apologize sincerely for my behavior last night. You were annoying, but only mildly so. I unloaded on you instead of the actual irritant. That’s not an excuse. I say it in hope that you’ll believe that I won’t be doing anything like that again. What I did was categorically wrong.

    Apology accepted. Thanks.

    colewd: If his program is generating random sequences, how does he know the generated sequences are functional. Kirk’s thesis is that FI requires a mind. Dr Swamidass is changing that argument to large levels of MISA require a mind.

    You are missing his thesis. He is making several claims that ALL have to be true for him to “definitively falsify” evolution.

    1. First, he claims that FSC (which is just MISA) can be reliably compute FI from extant sequences. He finds that FI (computed by MISA) is high in protein families.

    2. Second, he claims that FI is a unique capability of minds. Of course, without a definition of FI that is objective (as FSC is), this has no meaning.

    Therefore, combining #1 and #2, he claims to falsify evolution.

    Of course #2 has its own problems. He does not actually demonstrate this. He just posits this as a fact with no evidence. We can let that go for now.

    My main point is that FSC is not a good estimate of FI. Clearly processes that have nothing to do with function and should produce an FSC of 0, actually produce a much higher FSC. That means he has two options to resolve this data..

    1) Point #1 is false, and FSC is not a good estimate of FI.
    2) Point #2 is false, and mindless processes can produce sequences with high FSC.

    If either is false, his argument falls apart. That is what these simulations show. There is no change of argument. Rather it is a direct falsification of at least one of his critical premises.

    There are a few other points we touch on. For example, there is no reason to think FI is a good measure of evolvability any ways. And there is good reason to think a sequence alignment cannot give us FI too. These are all true, and point to additional reasons we have to doubt both #1 and #2.

  27. colewd: Is he really generating MISA here in the same way that Kirk is using it to measure FI? Kirk is looking at functional sequences. Do we have apples to oranges comparison here? If the sequence is non functional what does this mean?

    If a sequence is non-functional it means that it has zero functional constraints. This means that FI is 0.

    His method (if it works) should reliably estimate a FSC, FI, MISA of zero, thereby computing that all the proteins meet the criteria and that there are no functional constraints.

    If you object to this, a later experiment adds functional constraints.

    Tom English: substitution_rate = (N_ACIDS – 1.0) / N_ACIDS * (1 – np.exp(-rate))

    This is the line I was missing in your prior explanation. I thought you were using…

    substitution_rate = (1 – np.exp(-rate))

    With that addition (which is what your code actually does), I think we are doing the same simulation. Just in different ways.

  28. The MISA woman is a modern-day bohemian who spends her life traveling the world, leaping from one exotic location to another.

  29. Dr. Swamidass:

    The way (I think) I’m making sense of your paper now is to add “even if we ignore” intensifiers, as in “even if we ignore selection.” I needed to know early on that your general approach was to stack the deck in favor of Durston, by eliminating factors that stand only to decrease the MISA of a family of extant proteins. You say, later on, that one might add this, that, and the other factors to the simulation. I’d have followed you better if had described a realistic process up front, identified what constrains the diversification of proteins, and said, “OK, we’re going to strip this process down to mutation only, which is about as far as we can go in eliminating constraints on the process, and show that, even then, there is still generally high MISA in the protein family that results.”

    While I’m thinking about it, I’ll mention that describing the sequential sampling process of the simulation was part of what confused me. That’s not what’s going on in the evolutionary process. I’d recommend, in particular, that you change the x-axis labels of Figures 1 and 3. That’s not the “Number of Samples,” but instead the number of extant sequences. I’ve been trying to make sense of references to the sequences as proteins, as in, how can drifting, nonfuctional AA sequences continue to be proteins?

  30. Tom English: by eliminating factors that stand only to decrease the MISA of a family of extant proteins

    What factors (other than increasing mutation rate) decrease MISA? Turns out every factor increases MISA because every factor imparts patterns on the data distinct from a uniform distribution.

    Your point on clarity are helpful. I think the paper and the figures do need some better explanations. It is really clear and obvious to me, but obvious not to people outside this area. And it is important that this paper is understandable, so I will be doing some rewriting.

  31. swamidass,

    His method (if it works) should reliably estimate a FSC, FI, MISA of zero, thereby computing that all the proteins meet the criteria and that there are no functional constraints.

    If you object to this, a later experiment adds functional constraints.

    MISA is based on historical evidence of functional sequences related to a single protein. The statement above is not relevant because you are not simulating MISA. Since it is historical biological evidence, I am not sure that you can simulate it with any accuracy.

    A possible way to debate Kirk’s hypothesis would be to do experiments that would show biological function of sequences that were in addition to what the historical information (MISA) shows.

    If you could show that the historical evidence was only a small fraction of the available function then you would have an argument.

  32. colewd: MISA is based on historical evidence of functional sequences related to a single protein.

    What exactly do you mean by this? He is using the pfam database, which most certainly is not the “historical evidence of functional sequences related to a single protein”. It is not historical at all. I just looks at extant (i.e. current day) proteins (see the plural?) that align with one another.

  33. Dr. Swamidass:

    Publicly discussing a paper you have under review is a difficult thing to do, in any case. This particular case is more difficult than most, because you’re indicating that you’ve delivered a smackdown in a cultural battle. Your tweet to PZ Myers suggests to me that you’re quite sure of yourself. Anyone in your position would have a difficult time entertaining the possibility of fundamental errors in his paper.

    1. Should the default position be that an unreviewed preprint is correct, or that it is incorrect?
    2. Are you open to the possibility that your paper is severely flawed?
    3. Do you agree that you must simulate what Durston et al. have modeled, or your simulations are irrelevant to their models?
    4. Should reviewers not call on you to demonstrate that you have simulated processes modeled by Durston et al.?

  34. Dr. Swamidass:

    I find 4 matches for mind, and 15 for intel, in the unpublished polemic by Durston (unreferenced in your paper), and none at all in the published article by Durston, Chiu, Abel, and Trevors. It is imperative, not a matter of judgment, that you attribute the talk about minds and intelligence to Durston specifically.

  35. Dr. Swamidass:

    Now I see the careful wording of your introduction (emphasis added):

    In this societal context, it is important to correct faulty arguments against evolution [7, 25]. This is particularly important when components of these arguments appear in peer reviewed journals. To this end, this study aims to explain the modeling errors on which the “Functional Information” (FI) argument against evolution depends. Components of this argument are published in several peer reviewed articles [8, 13], and are elaborated in public debates, rebuttals and counter rebuttals (see SI). Science, of course, is not litigated through popular debates, but in careful studies. So it is important the faulty scientific claims in peer-reviewed publications are corrected, which is what we aim to do here.

    [It’s clear that you address the unpublished argument of Durston, but quite unclear that the last clause is true.] Please identify the faulty scientific claims of Durston, Chiu, Abel, and Trevors, and [explain] how you have corrected them.

    ETA: This is not a personal request from me. It is what any careful reviewer should require of you.

  36. Of course it is possible I am wrong. I am a fallible human that makes mistakes. If I saw a fundamental error in my work, I would retract it. The fact that this is not yet peer-reviewed makes retraction even easier.

    I’ve been very clear this is not yet peer-reviewed. Take it for what you think it is worth. A of good scientific work is not yet peer-reviewed, and a lot of bad work is peer reviewed. In the end, it is up to scientists to assess the quality of studies based on their own expertise, and to trust the expertise of others outside their field.

    I find this specific study to be an interesting questioned and wanted to experiment with biorxiv. So that is why we are talking now.

    Though I am about as close to 100% as possible that Durston has not definitively falsified evolution.

    Tom English:
    Dr. Swamidass:

    Now I see the careful wording of your introduction (emphasis added):

    [It’s clear that you address the unpublished argument of Durston, but quite unclear that the last clause is true.] Please identify the faulty scientific claims of Durston, Chiu, Abel, and Trevors, and [explain] how you have corrected them.

    ETA: This is not a personal request from me. It is what any careful reviewer should require of you.

    Of course it is .a personal request, and that is fine.

    The specific claim in the literature is that the FSC of extant sequences is a good estimate of FI. The claim is that applying the MI formula to a sequence alignment of proteins with shared function can good estimate of the number of sequences with that function.

    I disagree with this claim. It may not be a familiar way of making an argument, but the simulation approach I use here to test this claim is very consistent with how computational claims like this are tested. I have published several papers using the same strategy, and there are many many more in the literature that do the same.

    I’m seeing from this helpful exchange that the more explanation is required for people outside computational biology / computer science to follow the reasoning. That is very valuable insight. The paper needs a better explanation of this. I’ll probably take a crack at that in the coming weeks and let you know.

  37. swamidass,

    What exactly do you mean by this? He is using the pfam database, which most certainly is not the “historical evidence of functional sequences related to a single protein”. It is not historical at all. I just looks at extant (i.e. current day) proteins (see the plural?) that align with one another.

    If common descent is not an a priori assumption then I agree with you. All pfam proteins are, however, the result of some level of common descent within species.

    The main point is Durston is looking at the result of real biological processes where the sequences are functional. He then appears to be claiming that evolution has explored all the functional space and you can find all the possible sequences by looking at extant proteins of the animals that share a common ancestor.

    I agree that this claim is open for debate. What I don’t agree is that exploring non functional sequences tests this claim. MISA is the product of functional proteins.

    Kirks using mind as a conclusion of the origin of the proteins sequences is based on abductive reasoning, which is a method used by the historical sciences.

  38. Dr. Swamidass:

    Thank you for answering two of my questions. I would appreciate answers to the others.

    Tom English:
    3. Do you agree that you must simulate what Durston et al. have modeled, or your simulations are irrelevant to their models?
    4. Should reviewers not call on you to demonstrate that you have simulated processes modeled by Durston et al.?

    (In their paper, Equations 6 and 8 correspond to different models.)

  39. Tom, I really respect what you’re doing in this thread. Some would probably take it as giving aid and comfort to the enemy. But that would just mean that they don’t know you.

    You’ve no doubt noticed my acquiescence with your request to not go tit for tat with keiths. My pleasure. 🙂

  40. Patrick: I know you miss JoeG/Frankie, Mung, but that doesn’t mean you have to act like him.

    It would be stretching the truth to say that you’ve made any meaningful contribution to this thread.

  41. swamidass: The specific claim in the literature is that the FSC of extant sequences is a good estimate of FI. The claim is that applying the MI formula to a sequence alignment of proteins with shared function can good estimate of the number of sequences with that function.

    Would you please quote Durston, Chiu, Abel, and Trevors?

  42. Mung: Some would probably take it as giving aid and comfort to the enemy.

    I’m a defender of the two-way wall of separation of church and state, not of evolution. I was taught, growing up in Southern Baptist churches, to value the secular state as a guarantor of religious freedom to all. (Of course, the Southern Baptist Convention is something very different today than it was in the Sixties and Seventies.) In all honesty, I thought as a kid that keeping religion out of public institutions was an application of the Golden Rule. I realized that what would be offensive to me, if I were in a religious minority, I should not do when in the religious majority. (By the way, the most important thing that Christians in America can do for the Christians persecuted in Islamic countries is to treat Muslims in America wonderfully.)

    I also believe that the society of scientists should be secular. Swamidass has spoken of that value, here at TSZ. Yet he’s now trying to smuggle an internecine conflict of Christians into the scientific literature. I’m supposed to side with him because he comes down on the side of science. But that’s not the most important issue for me.

Leave a Reply