Wright, Fisher, and the Weasel

Richard Dawkins’s computer simulation algorithm explores how long it takes a 28-letter-long phrase to evolve to become the phrase “Methinks it is like a weasel”. The Weasel program has a single example of the phrase which produces a number of offspring, with each letter subject to mutation, where there are 27 possible letters, the 26 letters A-Z and a space. The offspring that is closest to that target replaces the single parent. The purpose of the program is to show that creationist orators who argue that evolutionary biology explains adaptations by “chance” are misleading their audiences. Pure random mutation without any selection would lead to a random sequence of 28-letter phrases. There are 27^{28} possible 28-letter phrases, so it should take about 10^{40} different phrases before we found the target. That is without arranging that the phrase that replaces the parent is the one closest to the target. Once that highly nonrandom condition is imposed, the number of generations to success drops dramatically, from 10^{40} to mere thousands.

Although Dawkins’s Weasel algorithm is a dramatic success at making clear the difference between pure “chance” and selection, it differs from standard evolutionary models. It has only one haploid adult in each generation, and since the offspring that is most fit is always chosen, the strength of selection is in effect infinite. How does this compare to the standard Wright-Fisher model of theoretical population genetics?

The Wright-Fisher model, developed 85 years ago, independently, by two founders of theoretical population genetics, is somewhat similar. But it has an infinite number of newborns, precisely N of whom survive to adulthood.  In its simplest form, with asexual haploids, the adult survivors each produce an infinitely large, but equal, number of offspring. In that infinitely large offspring population, mutations occur exactly in the expected frequencies. Each genotype has a fitness, and the proportions of the genotypes change according to them, exactly as expected. Since there are an infinite number of offspring, there is no genetic drift at that stage. Then finally a random sample of N of the survivors is chosen to become the adults of this new generation.

In one case we get fairly close to the Weasel. That is when N = 1, so that there is only one adult individual. Now for some theory, part of which I explained before, in a comment here. We first assume that a match of one letter to the target multiplies the fitness by a factor of 1+s, and that this holds no matter how many other letters match. This is the fitness scheme in the genetic algorithm program by keiths. A phrase that has k matches to the target has fitness (1+s)^k. We also assume that mutation occurs independently in each letter of each offspring with probability u, and produces one of the other 26 possibilities, chosen uniformly at random. This too is what happens in keiths’s program. If a letter matches the target, there is a probability u that it will cease to match after mutation. If it does not match, there is a probability u/26 that it will match after mutation.

The interesting property of these assumptions is that, in the Wright-Fisher model with N = 1, each site evolves independently because neither its mutational process nor its selection coefficients depend on the the state of the other sites. So we can simply ask about the evolution of one site.

If the site matches, after mutation 1-u of the offspring will still match, and u of them will not. Selection results in a ratio of (1+s)(1-u)\,:\,u of matching to nonmatching letters. Dividing by the sum of these, the probability of ending up with a nonmatch is then u divided by the sum, or

    \[ q \ = \ u\big/\left((1+s)(1-u)+u\right) \ = \ u \big/ (1 + s(1-u)) \]

Similarly if a letter does not match, after mutation the ratio of nonmatches to matches is \frac{u}{26}\,:\,1-\frac{u}{26}. After selection the ratio is (1+s)\frac{u}{26}\,:\,1-\frac{u}{26}. Dividing by the sum of these, the probability of ending up with a match is

    \[ p \ = \ \frac{u}{26}(1+s) \bigg/\left( \frac{u}{26}(1+s) + \left(1 - \frac{u}{26}\right)\right) \ = \ u(1+s) \big/ (26 + us) \]

Thus, what we see at one position in the 28-letter phrase is that if it is not matched, it has probability p of changing to a match, and if it is matched, it has probability q of changing to not being matched. This is a simple 2-state Markov process. I’m going to leave out the further analysis — I can explain any point in the comments. Suffice it to say that

1. The probability, in the long run, of a position matching is p/(p+q).

2. This is independent in each of the 28 positions. So the number of matches will, in the long run, be a Binomial variate with 28 trials each with probability p/(p+q).

3. The “relaxation time” of this system, the number of generations to approach the long-term equilibrium, will be roughly 1/(p+q).

4. At what value of s do we expect selection to overcome genetic drift? There is no value where drift does not occur, but we can roughly say that when the equilibrium frequency p/(p+q) is twice as great as the no-selection value of 1/27, that selection is then having a noticeable effect. The equation for this value is a quadratic equation in s whose coefficients involve u as well. If u is much smaller than s, then this value of s is close to 0.44.

5. The probability of a match at a particular position, one which started out not matching, will be after t generations

    \[ P(t) \ = \ \frac{p}{p+q} (1\:-\:(1-p-q)^t) \]

and the probability of a nonmatch given that one started out matching is

    \[ Q(t) \ = \ \frac{q}{p+q} (1\:-\:(1-p-q)^t) \]

I don’t have the energy to see whether the runs of keiths’s program verify this, but I hope others will.

291 thoughts on “Wright, Fisher, and the Weasel

  1. For a computer simulation for survivor population of size N, we could do the brute method that is somewhat easy conceptually, probably slow computationally.

    The difficulty is that it is desirable to have integer number of individuals, not fractional ones.

    Say the surviving population is N=1000

    So we have an individual with 27 matches. We give him 18,870 kids for s=0.44. We do the same calculation for the others and come up with a fairly sizable birth population. Then we do a random draw out of that birth population and form the surviving population of N=1000.

    That’s how I envision the algorithm will work without any tricks, it just may be a little slow.

  2. Choosing a finite number of adults out of the large, but finite, number of offspring is difficult. keiths did it one way. If one weights the individual offspring by their fitnesses, one can choose by sampling proportional to weights. However the total weight of each genotype then changes some after each individual is drawn.

    As the number of offspring gets very large, one approaches the infinite case, where sampling an individual causes no change in the mix of individuals available to sample.

  3. Sal:

    For a computer simulation for survivor population of size N, we could do the brute method that is somewhat easy conceptually, probably slow computationally.

    Why not just set NUM_SURVIVORS to N in my program, while making sure that POPULATION_SIZE is much, much greater?

    Joe:

    Choosing a finite number of adults out of the large, but finite, number of offspring is difficult. keiths did it one way. If one weights the individual offspring by their fitnesses, one can choose by sampling proportional to weights. However the total weight of each genotype then changes some after each individual is drawn.

    I thought about that issue when I was writing the code. You could avoid it by segregating the survivor slots from the offspring slots. Then, when sampling offspring, you would copy them to the survivor slots but leave them in the offspring pool where they could potentially be resampled.

    In the end, I decided that it didn’t make sense. For a given memory footprint, I think you’re better off dedicating as much of it to offspring as possible, so that the law of large numbers can work its magic.

  4. I am among the most math deficient in this discussion, but the consistent appearance of pretty bell curves makes me smile. Some regular process is afoot. I leave it to joe f to tell me if the logic is correct.

  5. It appears that the effect of s > 0 is attenuated for non-large samples of offspring.

    ETA: The sentence length L =100.

    selection: 0
    **************************************************************
    u, s = 0.01 0.0
    p, q = 0.005 0.01
    Long-run probability of match at site is p/(p+q) = 0.333333333333
    Long-run expected fitness is Lp/(p+q) = 33.3333333333
    Observed long-run fitness is 33.459282302
    Number of offspring is 10
    **************************************************************
    selection: 1
    **************************************************************
    u, s = 0.01 1.0
    p, q = 0.00995024875622 0.00502512562814
    Long-run probability of match at site is p/(p+q) = 0.664440734558
    Long-run expected fitness is Lp/(p+q) = 66.4440734558
    Observed long-run fitness is 62.8474613932
    Number of offspring is 10
    **************************************************************
    selection: 2
    **************************************************************
    u, s = 0.01 2.0
    p, q = 0.0148514851485 0.00335570469799
    Long-run probability of match at site is p/(p+q) = 0.815693430657
    Long-run expected fitness is Lp/(p+q) = 81.5693430657
    Observed long-run fitness is 75.3551827575
    Number of offspring is 10
    **************************************************************
    selection: 4
    **************************************************************
    u, s = 0.01 4.0
    p, q = 0.0245098039216 0.00201612903226
    Long-run probability of match at site is p/(p+q) = 0.923994038748
    Long-run expected fitness is Lp/(p+q) = 92.3994038748
    Observed long-run fitness is 84.9353405177
    Number of offspring is 10
    **************************************************************
    selection: 8
    **************************************************************
    u, s = 0.01 8.0
    p, q = 0.0432692307692 0.00112107623318
    Long-run probability of match at site is p/(p+q) = 0.974745021855
    Long-run expected fitness is Lp/(p+q) = 97.4745021855
    Observed long-run fitness is 91.5322741918
    Number of offspring is 10
    **************************************************************

  6. Corresponding to the plot I just posted:

    selection: 0
    **************************************************************
    u, s = 0.01 0.0
    p, q = 0.005 0.01
    Long-run probability of match at site is p/(p+q) = 0.333333333333
    Long-run expected fitness is Lp/(p+q) = 33.3333333333
    Observed long-run fitness is 33.157204755
    Number of offspring is 20
    **************************************************************
    selection: 1
    **************************************************************
    u, s = 0.01 1.0
    p, q = 0.00995024875622 0.00502512562814
    Long-run probability of match at site is p/(p+q) = 0.664440734558
    Long-run expected fitness is Lp/(p+q) = 66.4440734558
    Observed long-run fitness is 65.4541717587
    Number of offspring is 20
    **************************************************************
    selection: 2
    **************************************************************
    u, s = 0.01 2.0
    p, q = 0.0148514851485 0.00335570469799
    Long-run probability of match at site is p/(p+q) = 0.815693430657
    Long-run expected fitness is Lp/(p+q) = 81.5693430657
    Observed long-run fitness is 78.2395289412
    Number of offspring is 20
    **************************************************************
    selection: 4
    **************************************************************
    u, s = 0.01 4.0
    p, q = 0.0245098039216 0.00201612903226
    Long-run probability of match at site is p/(p+q) = 0.923994038748
    Long-run expected fitness is Lp/(p+q) = 92.3994038748
    Observed long-run fitness is 88.8115764915
    Number of offspring is 20
    **************************************************************
    selection: 8
    **************************************************************
    u, s = 0.01 8.0
    p, q = 0.0432692307692 0.00112107623318
    Long-run probability of match at site is p/(p+q) = 0.974745021855
    Long-run expected fitness is Lp/(p+q) = 97.4745021855
    Observed long-run fitness is 94.9761137651
    Number of offspring is 20
    **************************************************************

  7. The long-run fitness is just as predicted, for larger numbers of offspring. But, holding s = 4, and letting the number of offspring range over {8, 16, 32}:

    n_offspring: 8
    **************************************************************
    u, s = 0.01 4.0
    p, q = 0.0245098039216 0.00201612903226
    Long-run probability of match at site is p/(p+q) = 0.923994038748
    Long-run expected fitness is Lp/(p+q) = 92.3994038748
    Observed long-run fitness is 82.8873458505
    Number of offspring is 8
    **************************************************************
    n_offspring: 16
    **************************************************************
    u, s = 0.01 4.0
    p, q = 0.0245098039216 0.00201612903226
    Long-run probability of match at site is p/(p+q) = 0.923994038748
    Long-run expected fitness is Lp/(p+q) = 92.3994038748
    Observed long-run fitness is 87.9172314187
    Number of offspring is 16
    **************************************************************
    n_offspring: 32
    **************************************************************
    u, s = 0.01 4.0
    p, q = 0.0245098039216 0.00201612903226
    Long-run probability of match at site is p/(p+q) = 0.923994038748
    Long-run expected fitness is Lp/(p+q) = 92.3994038748
    Observed long-run fitness is 89.634596156
    Number of offspring is 32
    **************************************************************

    Here the alphabet size is still 3, and the sentence length is still 100. Assuming I haven’t fouled up the implementation of the model, it’s kind of interesting to see small samples attenuate the effect of increasing s.

  8. Some Weasel results.

    I’ve held the mutation rate at u=1/28, and number of offspring at 1000. I throw out the first 2000 generations in calculation of mean (“observed long-run”) fitness. I’ve added corresponding standard deviations at the bottom of this comment.

    selection (s): 0
    **************************************************************
    u, s = 0.01 0.0
    p, q = 0.000384615384615 0.01
    Long-run probability of match at site is p/(p+q) = 0.037037037037
    Long-run expected fitness is Lp/(p+q) = 1.03703703704
    Observed long-run fitness is 1.04495872491
    Number of offspring is 1000
    Sentence length is 28
    Alphabet size is 27
    **************************************************************
    selection (s): 2
    **************************************************************
    u, s = 0.01 2.0
    p, q = 0.00115295926211 0.00335570469799
    Long-run probability of match at site is p/(p+q) = 0.255720823799
    Long-run expected fitness is Lp/(p+q) = 7.16018306636
    Observed long-run fitness is 7.2933745574
    Number of offspring is 1000
    Sentence length is 28
    Alphabet size is 27
    **************************************************************
    selection (s): 4
    **************************************************************
    u, s = 0.01 4.0
    p, q = 0.00192012288786 0.00201612903226
    Long-run probability of match at site is p/(p+q) = 0.487804878049
    Long-run expected fitness is Lp/(p+q) = 13.6585365854
    Observed long-run fitness is 13.2219977347
    Number of offspring is 1000
    Sentence length is 28
    Alphabet size is 27
    **************************************************************
    selection (s): 16
    **************************************************************
    u, s = 0.01 16.0
    p, q = 0.00649847094801 0.000593824228029
    Long-run probability of match at site is p/(p+q) = 0.916271924209
    Long-run expected fitness is Lp/(p+q) = 25.6556138779
    Observed long-run fitness is 25.57621861
    Number of offspring is 1000
    Sentence length is 28
    Alphabet size is 27
    **************************************************************

    Addition. Here are standard deviations of fitness for the four runs, throwing out the first 2000 generations (as I did for the means, i.e., “observed long term fitness,” above).

    s=0, std dev=1.0249881097232614,
    s=2, std dev=2.2326912542112818,
    s=4, std dev=2.8600378859108697,
    s=8, std dev=1.5048244544570109

  9. So it converges to the expected fitness within a few % points in just 2000 generations? Damn, blind chance is sort of predictable, huh?

  10. dazz:
    So it converges to the expected fitness within a few % points in just 2000 generations? Damn, blind chance is sort of predictable, huh?

    In what way is this blind chance being predictable?

    Can someone explain to me, why is all of this not just a model of directed evolution? Surely it is set up to be directed towards a target?

  11. CharlieM: In what way is this blind chance being predictable?

    Can someone explain to me, why is all of this not just a model of directed evolution? Surely it is set up to be directed towards a target?

    I’m sorry Charlie, we’ve been discussing GA for months and that nonsense has been dealt with time and again. Do your homework and read the previous threads on GA’s.

  12. CharlieM: In what way is this blind chance being predictable?

    In much the same way that radioactive decay is predictable in the aggregate while being very much unpredictable as far as individual atoms are concerned, or in much that same way that casinos can reliably make a steady income off of games that are solidly founded on blind chance.

  13. I’ve posted my Python code — all 100 lines, including an example of how to use it — at my blog, Bounded Science. There is still a big hole in my testing. I haven’t yet varied the mutation rate, while holding all other parameters constant. But I’m pretty comfortable sharing the code with you now.

    Here are plots of another four runs, with a horizontal line added to each to show the mean fitness (first 5000 generations excluded). I’d add four other lines to show the Wright-Fisher predictions, if not for the fact that three of them are on top of the lines you see. There’s a discernible deviation for the red plot, which should not be surprising, because you can see that there’s greater variability in it than in the others.

  14. Tom English:
    I’ve posted my Python code — all 100 lines, including an example of how to use it — at my blog, Bounded Science. There is still a big hole in my testing. I haven’t yet varied the mutation rate, while holding all other parameters constant. But I’m pretty comfortable sharing the code with you now.

    Here are plots of another four runs, with a horizontal line added to each to show the mean fitness (first 5000 generations excluded). I’d add four other lines to show the Wright-Fisher predictions, if not for the fact that three of them are on top of the lines you see. There’s a discernible deviation for the red plot, which should not be surprising, because you can see that there’s greater variability in it than in the others.

    Thanks for sharing the code Tom.
    Maybe our creationist friends can answer how can it calculate the fitness of each generation without a single reference to the target string in it, LOL

  15. Tom, your testing seems very thorough. One phrase I’d modify is “long-run expected fitness”. The quantity you are computing is the long-run expected number of matches. Since the fitness is a nonlinear function of the number of matches the expectation of the fitness is different from that.

    In the theory with an infinite number of offspring and one surviving adult, the matches at one position are independent of the matches at another. At each position the fitness is 1+s if there is a match, 1 if there is not. So the expected fitness is 1+sp_1. The random variables at each position being independent, the expected value of the product of fitnesses across positions is the product of the expectations. So the expected fitness is (1+sp_1)^{28}.

    When the number of offspring is not infinite, there will be small random departures from independent distribution across positions. They do not persist from one generation to another. The expected fitness overall will still be the product of expectations of at then individual positions, but those are no longer exactly 1+sp_1.

  16. I also hope that at some point the other prediction of the theory will be tested: that the number of matches will follow a binomial distribution. At long-run equilibrium this will be a binomial distribution with 28 trials and probability of success equal to the long-term expectation. With an infinite number of offspring this probability will be p_1.

    However the outcomes will not be independent from generation to generation. That depends on the mutation rate — when u is small successive generations are autocorrelated. So testing fit of the histograms may be hard to do unless one chooses samples well-separated in time.

  17. I have an unrelated question for the experts. It’s again about additive vs geometric fitness functions. A few posts back Prof. Felsenstein pointed out that the correct fitness function for the additive case is 1+ks.
    The question is, with this function, as s tends towards infinity, the relative fitness of two genotypes with k and k' matches, tends towards \frac{k}{k'} (s cancels out as noted by keiths). It seems mathematically impossible to have an additive fitness model accurately deal with a selection coefficient if the best that one can do is \frac{k}{k'}. What use is this to model selection?

    I’m probably missing something obvious again

  18. Joe Felsenstein: The quantity you are computing is the long-run expected number of matches. Since the fitness is a nonlinear function of the number of matches the expectation of the fitness is different from that.

    Got it. You can see from the label on the y-axis of my last plot that I was trying to keep this straight. I’m accustomed to thinking of this somewhat differently, and I’m muddying the waters here. I’ve got a warning up at my blog, not to copy the code right now. I’ll change some identifiers (hopefully without breaking anything).

  19. CharlieM, the whole point of the Weasel algorithm is to show that evolutionary theory is not a theory of evolution “by chance”. dazz’s remark is sarcastic, which you missed.

    The Weasel refutes the statements in many lectures by creationists, statements that assure their audiences that evolutionary biologists explain adaptations as occurring “by chance”. If their audience believes that statement, they will of course immediately reject evolutionary biology as nonsense. After all, random genotypes will have such a small probability of producing exquisitely adapted phenotypes, that we will not ever see anything like the adaptations that we do see. The Weasel is a teaching demonstration showing that the processes evolutionary biologists envision are very, very far from random. That the statements of the creationist debaters are massively misleading.

    You seem to have countered with the objection that the Weasel model is of processes that are very, very far from random. Well, um yes, of course.

  20. dazz: It seems mathematically impossible to have an additive fitness model accurately deal with a selection coefficient if the best that one can do is \frac{k}{k'}. What use is this to model selection?

    Yes, that is a problem with the additive model. The multiplicative model does not have that problem, as that ratio is then (1+s)^{k-k'} which becomes infinite as s becomes infinite, if k > k'.

  21. Joe Felsenstein: Yes, that is a problem with the additive model.The multiplicative model does not have that problem, as that ratio is then which becomes infinite as becomes infinite, if k>k'.

    Thank you, and also thanks to keiths again for putting me in the right track

    petrushka: This seems to match keiths’ results. With s=5, he gets a mean of approximately 12.

    So keith’s algo modeling the process matches Tom’s algo running the mathematical model. Sweet.

    Now to test the binomial, wouldn’t it be enough to use a large L, and keep running keiths’ algo to see if the number of matches converges to pL ?

  22. Joe Felsenstein: So the expected fitness is (1+sp_1)^k.

    Joe Felsenstein: With an infinite number of offspring this probability will be p_1.

    Woops, notational bungle. In my own private notes I called the expected probability of matches p_1, but in this thread I called the long-term expectation p/(p+q) The expectation after time t was called P(t). Those quantities should replace p_1.

  23. Hi Tom,

    Any restrictions on the use of your code or do you consider ti public domain?

    It would be nice if we could all agree on using the same language here, even if it is Python. 😉

    No problems compiling Python, R or Ruby on Windows.

  24. petrushka: This seems to match keiths’ results. With s=5, he gets a mean of approximately 12.

    I ran with s=5, and with a mutation rate of 1/28. The mean number of matching sites is about 16.

    ************************************************************************
    selection (s): 5
    ************************************************************************
    Sentence length L : 28
    Alphabet size A : 27
    Number of offspring : 1000
    Mutation rate u : 0.0357142857143
    Selection parameter s : 5.0
    p = u * (1 + s) / (A-1 + u * s) : 0.00818553888131
    q = u / (1 + s * (1 – u)) : 0.00613496932515
    Assume equilibrium at generation: 5000
    Theoretical prob. of match at site p/(p+q) : 0.571595558153
    Expected number of matching sites Lp/(p+q) : 16.0046756283
    Mean of observed numbers of matching sites : 15.8920642941
    Std. dev. of observed nums. of matching sites: 2.61846152285
    ************************************************************************

  25. CharlieM,

    Can someone explain to me, why is all of this not just a model of directed evolution? Surely it is set up to be directed towards a target?

    It gets towards the target by means of variation and selection of genotypes in the current population. The programmer does not direct it towards the target. Indeed, there would be no point in writing GAs for problem solution if it were simply a matter of specifying a target and directing the program to find it.

    Consider a small modification – instead of distance from target, evaluate fitness by adding up the ASCII bits. Those with the greater sum are fitter than those with a lesser. There is no mention of a distant target – although it is clear that the program will converge on a string of all Zs, the program doesn’t know this. It is not drawn by, nor directed towards, that target. It is simply doing generational evaluation of fitter and less fit genotypes in the current population.

  26. Mung:
    Hi Tom,

    Any restrictions on the use of your code or do you consider ti public domain?

    It would be nice if we could all agree on using the same language here, even if it is Python.

    No problems compiling Python, R or Ruby on Windows.

    Well, you shouldn’t post it somewhere else. But I intend for you to use it. However, I’m not totally satisfied with my testing yet. It would be good if Keith matches my results with his radically different approach. What I really want to do is some mathematical analysis that covers the small-sample case. I’ll believe in my code and my math when I get them to agree.

  27. Generation: 5090 Number of survivors per generation: 1

    Selection: ON Coefficient: 5.00 Mutation rate: 0.03

    Organism 0 Fitness 1.31e+010 Hamming distance 15

    MEZHJYBSJIT IQDTGWE J OCASGD

    Histogram of Hamming distances:

    0: 0
    1: 0
    2: 0
    3: 0
    4: 0
    5: 16 X
    6: 32 XX
    7: 118 XXXXXXXXX
    8: 283 XXXXXXXXXXXXXXXXXXXXXXX
    9: 313 XXXXXXXXXXXXXXXXXXXXXXXXXX
    10: 411 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    11: 534 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    12: 442 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    13: 488 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    14: 718 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    15: 634 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    16: 354 XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    17: 230 XXXXXXXXXXXXXXXXXXX
    18: 180 XXXXXXXXXXXXXXX
    19: 129 XXXXXXXXXX
    20: 74 XXXXXX
    21: 37 XXX
    22: 6
    23: 11
    24: 17 X
    25: 24 XX
    26: 2
    27: 5
    28: 32 XX

  28. Keiths’ program rounds off mutation rate. It’s either .03 or .04.

    He said he was working on a test version that would cycle through the parameters and report only the mean and SD.

  29. Yes, running the numbers

    p=0.00818553888130968622100954979536
    q=0.00613496932515337423312883435583

    P(match) = p/(p+q) = 0.57159555815312682641729982466394

    P(match) * 28 = 16.00467562828755113968439509059

  30. petrushka:
    Keiths’ program rounds off mutation rate. It’s either .03 or .04.

    He said he was working on a test version that would cycle through the parameters and report only the mean and SD.

    I’m not sure that’s the issue. With u=.03 I get 16.04571429

  31. keiths’ program does not seem particularly sensitive to mutation rate. I think that’s to be expected.

    So a hamming distance of 12 is the same as 16 matches. Does that mean Tom’s and keiths’ programs are reporting the same results?

  32. Tom English: Well, you shouldn’t post it somewhere else.

    I’m thinking more along the lines of converting it into another language. I do have Python installed, I should probably just bite the bullet and use it. If I did convert it to another language I can’t see myself posting it anywhere but here. Just thought I’d ask first.

  33. petrushka:
    keiths’ program does not seem particularly sensitive to mutation rate. I think that’s to be expected.

    So a hamming distance of 12 is the same as 16 matches. Does that mean Tom’s and keiths’ programsare reporting the same results?

    Could be, I think I was missing a key aspect that should have been obvious: even if the long term average tends towards p/(p+q) the number of matches doesn’t really converge. I guess it’s going to drift and hover around the expected value, so sampling is necessary.

    keiths’ algo being at 13, 14 or 15 matches at any point in time is irrelevant, it’s the long term average that should converge… unless I’m making some other mistake again.

    And yes, the mutation rate is almost irrelevant bellow a certain point if my spreadsheet doesn’t lie. The number of matches tends to 16.25806452 as u tends to zero. At u=0.1 it’s 15.52.

  34. petrushka,

    Did I reverse it correctly? What I’ve loaded into the Python interpreter sums to 5090, which matches the generation you gave me. So it’s a problem, I believe, that you’re not discarding the values from before the system reaches equilibrium. That’s how I account for the extra mass on the left (the earlier values are lower than the later ones). I’ve always thrown out at least the first 2000 generations, just to be on the safe side.

    I’ll see, in a moment, if I get something that looks like this for the first 5090 data points.

    [ 32, 5, 2, 24, 17, 11, 6, 37, 74, 129, 180, 230, 354,
    634, 718, 488, 442, 534, 411, 313, 283, 118, 32, 16, 0, 0,
    0, 0, 0]

  35. Joe,

    I get bell-shaped histograms for intermediate values of s. But I don’t know what to make of the extremes.

    ETA: I fouled up the title. It should say u=1/L.

  36. Does this mean that the number of matches for s=5 hovers about 16 matches and doesn’t keep evolving towards the target?

  37. dazz:
    Does this mean that the number of matches for s=5 hovers about 16 matches and doesn’t keep evolving towards the target?

    You can’t see it in the red histogram (s=5), but there are 5 occurrences of 26. In 100 thousand generations, there were always 2 or more non-matching sites among the 28 sites.

  38. Generation: 5070 Number of survivors per generation: 1
    Selection: ON Coefficient: 5.00 Mutation rate: 0.036
    Mean fitness: 14.93 Standard Deviation: 2.594715

    Not throwing away first 2000.

    I added the following (plus declarations)

    if (generation > 2000){
    X = (double)genome_array[0].matches;
    X_sum += X;
    M = X_sum / ((double)generation - 2000) ;
    X -= M;
    X_M_sqr += X * X;
    S_dev = sqrt(X_M_sqr / ((double)generation - 2000));
    }

  39. Tom English: You can’t see it in the red histogram (s=5), but there are 5 occurrences of 26. In 100 thousand generations, there were always 2 or more non-matching sites among the 28 sites.

    Thanks, I’m trying to install dateutils to run your script and see it in action

  40. Throwing away the first 2000:

    Generation: 7010 Number of survivors per generation: 1

    Selection: ON Coefficient: 5.00 Mutation rate: 0.036

    Mean fitness: 16.04 Standard Deviation: 2.70

  41. petrushka,

    Here’s the histogram for the first 5090 generations of the red run (s=5) pictured above. The mean is just 15.14, contrasting with 15.97 when I throw out the first 5000 generations. My histogram is smoother, which leads me to ask also how many offspring there are per generation in your run. I’ve gone with 1000 here.

Leave a Reply