Wright, Fisher, and the Weasel

Richard Dawkins’s computer simulation algorithm explores how long it takes a 28-letter-long phrase to evolve to become the phrase “Methinks it is like a weasel”. The Weasel program has a single example of the phrase which produces a number of offspring, with each letter subject to mutation, where there are 27 possible letters, the 26 letters A-Z and a space. The offspring that is closest to that target replaces the single parent. The purpose of the program is to show that creationist orators who argue that evolutionary biology explains adaptations by “chance” are misleading their audiences. Pure random mutation without any selection would lead to a random sequence of 28-letter phrases. There are 27^{28} possible 28-letter phrases, so it should take about 10^{40} different phrases before we found the target. That is without arranging that the phrase that replaces the parent is the one closest to the target. Once that highly nonrandom condition is imposed, the number of generations to success drops dramatically, from 10^{40} to mere thousands.

Although Dawkins’s Weasel algorithm is a dramatic success at making clear the difference between pure “chance” and selection, it differs from standard evolutionary models. It has only one haploid adult in each generation, and since the offspring that is most fit is always chosen, the strength of selection is in effect infinite. How does this compare to the standard Wright-Fisher model of theoretical population genetics?

The Wright-Fisher model, developed 85 years ago, independently, by two founders of theoretical population genetics, is somewhat similar. But it has an infinite number of newborns, precisely N of whom survive to adulthood.  In its simplest form, with asexual haploids, the adult survivors each produce an infinitely large, but equal, number of offspring. In that infinitely large offspring population, mutations occur exactly in the expected frequencies. Each genotype has a fitness, and the proportions of the genotypes change according to them, exactly as expected. Since there are an infinite number of offspring, there is no genetic drift at that stage. Then finally a random sample of N of the survivors is chosen to become the adults of this new generation.

In one case we get fairly close to the Weasel. That is when N = 1, so that there is only one adult individual. Now for some theory, part of which I explained before, in a comment here. We first assume that a match of one letter to the target multiplies the fitness by a factor of 1+s, and that this holds no matter how many other letters match. This is the fitness scheme in the genetic algorithm program by keiths. A phrase that has k matches to the target has fitness (1+s)^k. We also assume that mutation occurs independently in each letter of each offspring with probability u, and produces one of the other 26 possibilities, chosen uniformly at random. This too is what happens in keiths’s program. If a letter matches the target, there is a probability u that it will cease to match after mutation. If it does not match, there is a probability u/26 that it will match after mutation.

The interesting property of these assumptions is that, in the Wright-Fisher model with N = 1, each site evolves independently because neither its mutational process nor its selection coefficients depend on the the state of the other sites. So we can simply ask about the evolution of one site.

If the site matches, after mutation 1-u of the offspring will still match, and u of them will not. Selection results in a ratio of (1+s)(1-u)\,:\,u of matching to nonmatching letters. Dividing by the sum of these, the probability of ending up with a nonmatch is then u divided by the sum, or

    \[ q \ = \ u\big/\left((1+s)(1-u)+u\right) \ = \ u \big/ (1 + s(1-u)) \]

Similarly if a letter does not match, after mutation the ratio of nonmatches to matches is \frac{u}{26}\,:\,1-\frac{u}{26}. After selection the ratio is (1+s)\frac{u}{26}\,:\,1-\frac{u}{26}. Dividing by the sum of these, the probability of ending up with a match is

    \[ p \ = \ \frac{u}{26}(1+s) \bigg/\left( \frac{u}{26}(1+s) + \left(1 - \frac{u}{26}\right)\right) \ = \ u(1+s) \big/ (26 + us) \]

Thus, what we see at one position in the 28-letter phrase is that if it is not matched, it has probability p of changing to a match, and if it is matched, it has probability q of changing to not being matched. This is a simple 2-state Markov process. I’m going to leave out the further analysis — I can explain any point in the comments. Suffice it to say that

1. The probability, in the long run, of a position matching is p/(p+q).

2. This is independent in each of the 28 positions. So the number of matches will, in the long run, be a Binomial variate with 28 trials each with probability p/(p+q).

3. The “relaxation time” of this system, the number of generations to approach the long-term equilibrium, will be roughly 1/(p+q).

4. At what value of s do we expect selection to overcome genetic drift? There is no value where drift does not occur, but we can roughly say that when the equilibrium frequency p/(p+q) is twice as great as the no-selection value of 1/27, that selection is then having a noticeable effect. The equation for this value is a quadratic equation in s whose coefficients involve u as well. If u is much smaller than s, then this value of s is close to 0.44.

5. The probability of a match at a particular position, one which started out not matching, will be after t generations

    \[ P(t) \ = \ \frac{p}{p+q} (1\:-\:(1-p-q)^t) \]

and the probability of a nonmatch given that one started out matching is

    \[ Q(t) \ = \ \frac{q}{p+q} (1\:-\:(1-p-q)^t) \]

I don’t have the energy to see whether the runs of keiths’s program verify this, but I hope others will.

291 thoughts on “Wright, Fisher, and the Weasel

  1. Nobody is willfully misleading audiences about evolutions reliance on chance .
    AMEN. If evolutionism admits and surrenders to the FACT that chance CAN NOT produce the biology we all love then progress is made.
    Historically evolutionists did say chance alone create biology.
    Or we picked it up somewhere.
    HMMM.
    Something about these Man made programs showing how ‘operation” can prodice sentences concerning weasels. HMMM.
    There might be a hidden direction concept going on. Does it include extinctions?
    Anyways I don’t think math has anything to do with understanding biology origins.

  2. Robert Byers: creationist debaters do, a lot, mislead their audiences into thinking that evolutionary biologist invoke only chance in their theory.

    But evolutionary biologist do not invoke only chance.

    I and thousands of biologists do mathematical modeling of evolutionary processes and find that it is worth doing.

    So you are wrong in all three of your points. So there.

  3. Joe, to Robert:

    So you are wrong in all three of your points. So there.

    Why do you hate God, Joe?

  4. For s = 0.44

    and we have 27 matches, the individual will have the following number of kids

    (1+0.44) ^27 = 18,870 offspring

    Is that right?

  5. Actually Joe, evolutionary biology does explain adaptions by chance. The misdirection is all yours (pl).

    How? When extended reproduction provides the driving force for random variations which are weeded out by selection, supports of non-teleological evolution claim that those variations have no explanation. They just happen for whatever reason but there is no teleology behind it.

    But we know that is not true. How? Variations that occur which are detrimental to a genome (via copying errors, radiation, etc), are detected and eliminated. So logically, if a variation is screened and deemed not a threat to the genome, then it is allowed. The environment will determine which one of the allowed variations will match best with the environmental condition currently in play.

    But it is the intelligence (ie the capacity to analyse, sort, eliminate possibilities) embedded within the genome that drives the process. The genome both produces variations and experiences variations beyond its control. Again, those that are not produced internally but happen due uncontrollabe conditions and thus deemed dangerous are eliminated. Those that are not, stay.

    Again, excess reproduction allows a sufficient amount of variation (whether through internal production or external influence) which selection sorts through and allows the variation that best fits the conditions at the time.

    This is more logical and reasonable explanation than the evolutionary meme of survival of the fittest since we know that the biosphere is not a competition for limited resources but in fact is an inter-dependent, interactive, cooperative endeavor between all organisms.

    Weasel and your math simply don’t speak to the elephant in the room, which is the pre-existing machinery of not only of a replicating genome, but a replicating genome that is inheritantly capable of variation by its sufficiently complex starter kit.

    Weasel is just a distraction. It speaks to 2% while the 98% is still waiting for a non-teleological explanation.

    So who is misleading who here?

  6. Steve:
    Actually Joe, evolutionary biology does explain adaptions by chance.The misdirection is all yours (pl).

    How?When extended reproduction provides the driving force for random variations which are weeded out by selection, supports of non-teleological evolution claim that those variations have no explanation.They just happen for whatever reason but there is no teleology behind it.

    But we knowthat is not true.How?Variations that occur which are detrimental to a genome (via copying errors, radiation, etc), are detected and eliminated.So logically, if a variation is screened and deemed not a threat to the genome, then it is allowed. The environment will determine which one of the allowed variations will match best with the environmental condition currently in play.

    But it is the intelligence (ie the capacity to analyse, sort, eliminate possibilities) embedded within the genome that drives the process.The genome both produces variations and experiences variations beyond its control.Again, those that are not produced internally but happen due uncontrollabe conditions and thus deemed dangerous are eliminated.Those that are not, stay.

    Again, excess reproduction allows a sufficient amount of variation (whether through internal production or external influence) which selection sorts through and allows the variation that best fits the conditions at the time.

    This is more logical and reasonable explanation than the evolutionary meme of survival of the fittest since we know that the biosphere is not a competition for limited resources but in fact is an inter-dependent, interactive, cooperative endeavor between all organisms.

    Weasel and your math simply don’t speak to the elephant in the room, which is the pre-existing machinery of not only of a replicating genome, but a replicating genome that is inheritantly capable of variation by its sufficiently complex starter kit.

    Weasel is just a distraction. It speaks to 2% while the 98% is still waiting for a non-teleological explanation.

    So who is misleading who here?

    You are, because your entire post reduces to a question about the origin of life, not it’s diversification through evolution.

  7. stcordova:
    For s = 0.44

    and we have 27 matches, the individual will have the following number of kids

    (1+0.44) ^27 = 18,870 offspring

    Is that right?

    It would be if these were absolute fitnesses. Actually they are relative fitnesses. All that is needed is that the fitnesses fall into a geometric series 1 : 1+s : (1+s)^2 : … for them to cause the changes of genotype frequencies. They could all be multiplied by the same constant C and they would still cause the same changes of frequencies of genotypes.

  8. Steve,

    Your view of mutation and selection is very strange, and involves lots of assumptions by you. Mutation occurs (it has to occur because there is no such thing as a perfect replication machinery). You imagine an internal machinery that sorts among mutations, discarding those that are deleterious and allowing those that aren’t.

    There is simply no evidence of such machinery in the cell. If it was there we’d see it operating. But what seems to happen is that mutations make random changes in the phenotype, and these interact in various ways with the environment, and that results in organisms having different fitnesses. We see that happening and it is no surprise.

    Genetic leprechauns are not seen. If you think they are there, show them to us. Also front-loaded complex machinery is inherently implausible — the machinery to make, say, the elephant’s trunk would be eroded by mutation during the hundreds of millions of years that the elephant lineage was evolving before it got to the first elephant.

  9. Joe:

    It would be if these were absolute fitnesses. Actually they are relative fitnesses.

    Thanks Joe, I was just checking my understanding since the number seemed so big — the worst from the best was different by a factor of 18,870.

    Would the result of s=0.44 be valid independent of the genome size, that is, could we model the problem with a short phrase, say “test”.

    That might be computationally feasible and we can make the population size large without bogging the computer down. A phrase with 28 letters might be challenging.

    The way I read the model, the only variables influencing the outcome are u and s. Is that correct?

    Thanks again.

  10. Steve: But it is the intelligence (ie the capacity to analyse, sort, eliminate possibilities) embedded within the genome that drives the process.

    No it’s not. Unless you have some, you know, evidence for that?

  11. This really is great stuff. It’s gonna take me a while to figure out why a geometrical fitness function (1+s)^k and not some other like k(1+s) (funny how that looks a lot like keiths in leet speak hahaha).

    I take it a geometrical function like that is what makes each site independent and allows the following mathematical analysis. I’ll need to do some googling on Markov processes.

    Thanks so much for this Prof. Felsenstein

  12. dazz,

    I take it a geometrical function like that is what makes each site independent

    No, they are just independent! There is no epistasis, with selection – the fitness of one site is not boosted or diminished by what is at another.

  13. Allan Miller:
    dazz,

    No, they are just independent! There is no epistasis, with selection – the fitness of one site is not boosted or diminished by what is at another.

    I know, I know Allan. I’m just trying to wrap my head around the model. I was referring to this in particular:

    The interesting property of these assumptions is that, in the Wright-Fisher model with N = 1, each site evolves independently because neither its mutational process nor its selection coefficients depend on the the state of the other sites. So we can simply ask about the evolution of one site.

  14. dazz,

    This really is great stuff. It’s gonna take me a while to figure out why a geometrical fitness function (1+s)^k and not some other like k(1+s) (funny how that looks a lot like keiths in leet speak hahaha).

    🙂

  15. dazz,

    Here’s how I think about it.

    Imagine we have two genotypes that are identical except for one locus, let’s say site number 23. If neither genotype matches the target at site 23, then we want the genotypes to have identical fitness. If one of them matches the target at site 23, we want it to have higher fitness than the other. Specifically, we want it to be higher by a factor of 1+s.

    But we don’t just want this to be true for two otherwise identical genotypes that differ at site 23, we also want it to be true for two otherwise identical genotypes that differ at any single site.

    The way to achieve this is to have each site independently increase the overall fitness of the genotype by a factor of 1+s if it matches. In other words, the overall fitness should be (1+s)^k, where k is the number of matches.

  16. keiths:
    dazz,

    Here’s how I think about it.

    Imagine we have two genotypes that are identical except for one locus, let’s say site number 23. If neither genotype matches the target at site 23, then we want the genotypes to have identical fitness.If one of them matches the target at site 23, we want it to have higher fitness than the other. Specifically, we want it to be higher by a factor of 1+s.

    But we don’t just want this to be true for two otherwise identical genotypes that differ at site 23, we also want it to be true for two otherwise identical genotypes that differ at any single site.

    The way to achieve this is to have each site independently increase the overall fitness of the genotype by a factor of 1+s if it matches.In other words, the overall fitness should be (1+s)^k, where k is the number of matches.

    Thanks keiths, But I understand why if each match increases fitness by a factor of (1 + s) we get (1 + s)^k, why I don’t understand is why pick “by a factor of (1 + s)” and not “add (1 + s) to fitness for each match” for example, which would result in a fitness function of k(1 + s).

    I suspect that’s important but I’m not sure why at this point

  17. Steve:

    When extended reproduction provides the driving force for random variations which are weeded out by selection, supports of non-teleological evolution claim that those variations have no explanation.

    Bullshit. What is claimed, by “supports of non-teleological evolution”, is that the explanation for said variations doesn’t happen to involve teleology. A non-teleological explanation is still an explanation… except maybe in the eyes of someone who believes that all explanations just plain are teleological, I guess…

    They just happen for whatever reason but there is no teleology behind it.

    Is it your position that all explanations just plain are teleological, Steve?

    Variations that occur which are detrimental to a genome (via copying errors, radiation, etc), are detected and eliminated.

    “detected”, how? By what means and/or agency?

    “eliminated”, how? By what means and/or agency?

    So logically, if a variation is screened and deemed not a threat to the genome, then it is allowed.

    “screened”, how? By what means and/or agency?

    “deemed not a threat”, how? By what means and/or agency?

    The environment will determine which one of the allowed variations will match best with the environmental condition currently in play.

    Yep. Since the environment is (in mainstream biology, at least) what’s doing the “allow[ing]” in the first place, one can indeed say that the environment “determine[s]” which variations are “allowed”, no teleology need apply. Is this a problem?

  18. dazz,

    why I don’t understand is why pick “by a factor of (1 + s)” and not “add (1 + s) to fitness for each match” for example, which would result in a fitness function of k(1 + s).

    If you used k(1+s) instead of (1+s)^k, the fitnesses wouldn’t have the desired ratios in the cases I described.

    If one genotype matched at k sites and the other matched at k+1 sites, then the ratio of fitnesses would be (k+1)(1+s)/k(1+s), or (k+1)/k. The s terms would cancel out, which is not what you want.

    You want the ratio to be (1+s) for any two otherwise identical genotypes that differ at a single site, with one of the two being a match at that site. You get that with a (1+s)^k fitness function but not with k(1+s).

  19. keiths,

    Keiths, don’t get me wrong, I appreciate that, but I think I didn’t make myself clear enough: I know that with that geometric fitness function the fitness increases by a factor of 1+s for each match, independently of where it is, I know that’s not the case for any other function, what I don’t know is why one would pick that condition and not another like an arithmetic one.

    I know that the selection coefficient cancels out in the ratio of any two genotypes with the arithmetic fitness function, what I don’t understand is what are the implications of that to the model, and why the geometric function is picked instead.

  20. OK, I think I get it now:

    The interesting property of these assumptions is that, in the Wright-Fisher model with N = 1, each site evolves independently because neither its mutational process nor its selection coefficients depend on the the state of the other sites. So we can simply ask about the evolution of one site

    With an arithmetic fitness function, if there are say 10 matches, adding another match increases it’s chances of being selected by a smaller amount than if there where for example 5.

  21. keiths: You want the ratio to be (1+s) for any two otherwise identical genotypes that differ at a single site, with one of the two being a match at that site

    Am I dumb or what, that’s exactly what you were saying there. My apologies

  22. With an arithmetic fitness function, if there are say 10 matches, adding another match increases it’s chances of being selected by a smaller amount than if there where for example 5.

    My understanding is that there is an additive fitness model and then a multiplicative fitness model. Both appear in the literature. The one in Joe’s OP is a multiplicative model.

  23. stcordova: My understanding is that there is an additive fitness model and then a multiplicative fitness model.Both appear in the literature.The one in Joe’s OP is a multiplicative model.

    I was wondering which one, if any, best models what’s going on in RL

  24. BTW, I guess another reason a fitness function of k(1+s) doesn’t make sense is because for s=0 genotypes with more matches would still have a greater chance of being selected when it should be fully random/equiprobable. In any case it should be k*s

    But of course as keiths pointed out, s cancels out when comparing relative fitness: it doesn’t matter what s is, only the number of matches has an impact on the probability of being selected. A descendant with 10 matches would have twice the chances of being selected than a descendant with 5 matches, irrespective of s.

    I remember struggling with this when I was trying to make Sal’s weasel algo take s into account for all descendants, and not only the fittest one: intuitively the solution seems to be to make it exponential: s^k for s in [0:infinity)

    But for both s=0 and s=1, k doesn’t effect relative fitness and both are full blown drift, hence (1+s)^k is used instead so that only s=0 implies total drift

  25. I see that while I was away (sleeping) the additive-versus-multiplicative debate has raged. keiths gave a good explanation. A few points:

    1. The additive model uses the formula 1+ks, not k(1+s).

    2. One reason for preferring the multiplicative model is that when independent sources of mortality are affected by different loci, one gets multiplicativity. If you have a 30% chance of being eaten by a predator, and a 30% chance of dying of a disease, your probability of surviving both is not 40%, but 49%.

    3. The population genetics literature on effects of multiplicativity includes

    Felsenstein, J. 1965 The effect of linkage on directional selection. Genetics 52: 349-363. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1210855/

    Eshel, I. and M. W. Feldman. 1970. On the evolutionary effect of recombination. Theor Popul Biol. 1970 May;1(1):88–100.
    http://dynamics.org/~altenber/LIBRARY/REPRINTS/Feldman+Eshel_recombination_TPB.1970.pdf

    Maynard Smith, J. 1974. Recombination and the Rate of Evolution. Genetics. 1974 Sep; 78(1): 299–304.
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1213188/

  26. dazz,

    BTW, how do you embed those img from quicklatex in your posts please?

    Put the latex code inside a [latex] [/latex] pair, like this:

    [latex]
    e^{ \pm i\theta } = \cos \theta \pm i\sin \theta
    [/latex]

  27. One thing I haven’t figured out yet is how to successfully edit a comment containing latex code.

    The editing software seems to mangle latex, though the initial comment posting software does not.

    Joe, Tom, how do you guys do it?

  28. Okay, maybe I have it figured out. The editing software strips out backslashes, so when editing a comment (as opposed to posting it the first time) you need to use double backslashes, like so:

    [latex]
    e^{ \\pm i\\theta } = \\cos \\theta \\pm i\\sin \\theta
    [/latex]

     e^{ \pm i\theta } = \cos \theta \pm i\sin \theta

  29. keiths:
    One thing I haven’t figured out yet is how to successfully edit a comment containing latex code.

    The editing software seems to mangle latex, though the initial comment posting software does not.

    Joe, Tom, how do you guys do it?

    We keep it a secret.

    Actually, I had the same problem, and finally noticed that the Save command from the editing box that appears when you click on the Edit link below your comment has the unfortunate habit of stripping out backslashes. I have to make each backslash a double-backslash, each time I edit. Then it strips out only one of the pair and I am OK. Until the next edit when I have to put the extra backslashes back in. While privately expressing my lack of gratitude to whomever designed that feature.

  30. keiths: Joe, Tom, how do you guys do it?

    I haven’t found a way to do it in threads I didn’t open. (Use the verboten edit for comments in your own thread.) I usually have a \LaTeX (did I get “LaTeX” right?) document open. So I usually prepare my comment there, and copy-and-paste it here.

  31. dazz:

    Alternatively, instead of having the fitness function be

        \[ 1/(1-s')^k \]

    You could multiply all these by (1-s')^L, with L the number of sites (28 in our case). Then the fitness function for k matches would be

        \[ (1-s')^{L-k} \]

    so that the fitnesses for k = 0, 1, 2, \dots, would be (1-s')^{28}, (1-s')^{27}, (1-s')^{26}, \dots, 1-s', 1. The relation between s' and s would be the one you have suggested. The fitnesses here are relative fitnesses because you can multiply them by any constant and have them have the same effect.

  32. Joe Felsenstein: Actually, I had the same problem, and finally noticed that the Save command from the editing box that appears when you click on the Edit link below your comment has the unfortunate habit of stripping out backslashes.

    You should try coding SQL Server Stored Procedures.

  33. OK Folks. Just use dollar signs. Single dollar signs surrounding latex code should format into LaTex and double dollar signs around LaTex code puts it on a line by itself.

  34. BTW LaTex is enabled for the site, posts and comments. No need to precede LaTex with [latex]

  35. dazz,

    You might find this link helpful. Pavel Holoborodko wrote the plugin that this site is using.

    ETA

    \LaTeX in single dollar signs.

        \[\LaTeX\]

    in double dollar signs.

    ETA2

    Oops tried to use HTML “code” tags to show what I typed. Didn’t work!

Leave a Reply