A selection-vs-drift version of Weasel

Posted on February 22, 2016 by keiths

In his endless pursuit of that wascally Weasel, Mung made the following silly claim:

GAs are often used to demonstrate “the power of cumulative selection.” Given small population sizes drift ought to dominate yet in GAs drift does not dominate.

That is clearly false, but for the benefit of Mung (and his cousin Elmer) I have modified my Weasel program to incorporate both drift and selection. They can now see for themselves that small population sizes are insufficient to guarantee that drift dominates selection.

The code is here. Compile it under Linux using “gcc -std=gnu99 -lm weasel.c -o weasel”.

Run the program and type ‘h’ to see a list of interactive commands:

c – clear the histogram data
f – change the selection coefficient
h – print this help message
m – change the mutation rate
p – pause until a key is pressed
q – quit the program
s – toggle selection on/off
t – change the target phrase

The program generates and updates a scaled histogram showing the number of generations spent at each possible Hamming distance from the target.

222 thoughts on “A selection-vs-drift version of Weasel”

keiths on February 25, 2016 at 12:33 am said:

petrushka,

You could make an outer loop that runs through the permutations and prints out just the parameters and mean fitness.

Yes, that’s what I plan to do.

Just to make mung happy.

I’m not worried about that. We’ve proven him wrong. Any continued denial on his part is a psychological issue, not an empirical one.
John Harshman on February 25, 2016 at 12:56 am said:

keiths,

Still don’t understand. What is the next generation for the organism 0 that’s printed? The one that happened to be the next one when the histogram is printed?

It would be clearer if the starting sequence were printed somewhere.
keiths on February 25, 2016 at 1:17 am said:

John:

Still don’t understand. What is the next generation for the organism 0 that’s printed? The one that happened to be the next one when the histogram is printed?

It would be clearer if the starting sequence were printed somewhere.

This will all make more sense if you actually run the program. Petrushka posted a Windows executable here, assuming you have access to a Windows machine.

The entire display gets updated as the program runs. That includes the Organism 0 genome.

It will start out looking something like this…

Organism 0 Fitness 1.30e+03 Hamming distance 24

JJYEISJSHHONVBXKNTVZBJWTSOE

…and then improve through selection to something like this…

Organism 0 Fitness 6.09e+14 Hamming distance 9

MEGOIGKS ITNQS LILE D WRZSEL

…and, if the parameters are set appropriately, to something like this…

Organism 0 Fitness 1.85e+40 Hamming distance 1

METHINKS IT IS LIUE A WEASEL

or even a perfect match:

Organism 0 Fitness 5.73e+41 Hamming distance 0

METHINKS IT IS LIKE A WEASEL
John Harshman on February 25, 2016 at 5:23 am said:

keiths,

Might I suggest that as an aid to comprehension that the histogram include a symbol labeling the starting sequence’s hamming distance?

And someone might try to discover what combinations of parameters will end up with a stationary mode at the position of the starting sequence.
DNA_Jock on February 25, 2016 at 2:04 pm said:

I think you can model the behavior of this program (for most parameter settings) reasonably accurately as a Markov Chain: for each Hamming distance, you can calculate the probability that the next generation will be one better p(-1), the same p(0), or one worse p(+1).
Masochists and pedants can include p(+2), p(-2), etc, etc…
Then you can use Monte-Carlo to find the equilibrium distribution.
After the initial burn-in has occurred, the distribution is not sensitive to the starting position, thanks to the simplicity of the fitness surface.
I think this is what DiEb did for the Weasels. Joe F does this kind of analysis for a living…
ETA: I suspect that, for multiple-survivor situations, running keiths’s model will be easier that figuring out the transition probabilities…
petrushka on February 25, 2016 at 2:12 pm said:

The program is kind of a visual aid for the math impaired. It’s useful to watch it happening. It seems to quiet down the genetic meltdown crowd.
Allan Miller on February 25, 2016 at 2:28 pm said:

Hardly a major point, but the population takes a short while to settle into equilibrium given its start variance. When there is only one survivor, this takes 1 generation, so it’s even more trivial, but when you bump up the number of survivors, you’d need a short period of drift from a fully random start population (where all are about the same Hamming distance from each other, let alone the target) to get a population that is generating and fixing mutants at a rate consistent with its mutation rate, before releasing the hare … I mean weasel.

This affects virtually nothing; just an observation.
Joe Felsenstein on February 25, 2016 at 3:05 pm said:

Allan Miller,

Alan, for the relevant equation see chapter VII of my online text Theoretical Evolutionary Genetics, section VII.2 on “Drift versis mutation”, the subsection on “Finite numbers of alleles”. The probability called F there is the probability for one position in the phrase that two randomly chosen individuals among the survivors have the same letter, The recurrence equation, the first one given in that subsection, simply needs us to set K=27. (The theory there is for diploids, so the quantity called 2N is the number of survivors).

It is a simple linear recurrence equation in F, and we can work out from that what is the rate of approach to the equilibrium value of F.

By the way, I am working on a largish post on the mathematics of the N=1 case and should have it up soon.
keiths on February 25, 2016 at 5:04 pm said:

Joe,

By the way, I am working on a largish post on the mathematics of the N=1 case and should have it up soon.

I look forward to it!
Mung on February 25, 2016 at 5:38 pm said:

Result: METHINKS IT IS LIKE A WEASEL

Which of those letters became fixed as a result of genetic drift?
Mung on February 25, 2016 at 5:46 pm said:

keiths: The genotypes are initialized randomly when the program starts…

Just like in real biological populations.
Mung on February 25, 2016 at 5:51 pm said:

keiths: We’ve proven him wrong. Any continued denial on his part is a psychological issue, not an empirical one.

You’re not fooling anyone, keiths.
Allan Miller on February 25, 2016 at 6:05 pm said:

Mung,

Just like in real biological populations.

I must be psychic. I suggested above a small possible modification, to permit a population to equilibrate first. I didn’t say, but at the back of my mind was the Creationist carping: “but you have created an unrealistically wide set of start genomes”.

Give it a whirl.

Then try setting the start population up as all X’s – all 28 Hamming units away; as far away as you can get, and with no initial variation. Yes, I know, real genomes don’t consist of letters, and are more than 28 bits long.
Mung on February 25, 2016 at 6:13 pm said:

Even you Allan, in spite of your “skepticism”, must admit that generating genomes at random starts each genome in a randomly selected area of the search space.

Does that really matter? Does that really change the ability of the program to find the target phrase? Would the result be any different if you started them all out the same or with only a single letter difference between them?
petrushka on February 25, 2016 at 6:16 pm said:

Living things are, by definition, already in an equilibrium state.

Are you going to OOL us?

ETA:

It would probably be useful to think of weasel modelling a duplicated gene — not necessary — but available to acquire a new function.
Allan Miller on February 25, 2016 at 6:29 pm said:

Mung,

Even you Allan, in spite of your “skepticism”, must admit that generating genomes at random starts each genome in a randomly selected area of the search space.

F’God’s sake. I addressed this point, both before you posted and after it.

Does that really matter? Does that really change the ability of the program to find the target phrase? Would the result be any different if you started them all out the same or with only a single letter difference between them?

And this.

Try all X’s. Try anything you like. Why are all you coders so coding-shy?
Allan Miller on February 25, 2016 at 6:31 pm said:

petrushka,

Living things are, by definition, already in an equilibrium state.

True, but I was thinking of a population-genetic equilibrium – allowing a few generations of drift.
petrushka on February 25, 2016 at 7:29 pm said:

I think the question of fixation in weasel might be interesting, but there is a problem.

There are no neutral alleles in weasel, no synonymous or nearly synonymous substitutions.

It might be fun to allow upper and lowercase letters.
Joe Felsenstein on February 25, 2016 at 8:14 pm said:

petrushka, in my equations for the Wright-Fisher model that comes close to being Weasel, there is a selection coefficient s, and one could make that zero.

Also, at any position where the letter does not match the target, any mutation to another nonmatching letter will behave as if neutral, even though in reality both are deleterious.
petrushka on February 25, 2016 at 8:21 pm said:

I think the word “fix” might be misleading. It implies something permanent or unchangeable.

But what could it mean in a population that always bottlenecks to one parent?

I suspect Mung wants to draw attention away from what the program does model, and toward anything it fails to model.
Mung on February 25, 2016 at 8:41 pm said:

petrushka: Living things are, by definition, already in an equilibrium state.

If by this you mean that if you calculate the Hamming distance between two members of a population it’s going to be pretty much the same regardless of which two members of the population are chosedn, maybe keiths can answer that.
Mung on February 25, 2016 at 8:44 pm said:

petrushka: I suspect Mung wants to draw attention away from what the program does model, and toward anything it fails to model.

You mean like genetic drift?
Mung on February 25, 2016 at 8:45 pm said:

Allan Miller: Why are all you coders so coding-shy?

Ahah! So you do know what a rhetorical question is!
petrushka on February 25, 2016 at 9:02 pm said:

Mung: You mean like genetic drift?

It can model drift just fine if you set selection coefficient to 0.
petrushka on February 25, 2016 at 9:27 pm said:

Here ya go mung.

20 parents, 10,000 generations, zero selection

14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 3
22: 4
23: 54 X
24: 113 XXX
25: 274 XXXXXXXX
26: 1144 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
27: 1731 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
28: 1960 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
petrushka on February 25, 2016 at 9:32 pm said:

Here’s the population. I see at least one position fixed
QTSYAGUWJGYROICWTYYMDMTTJDKT QTKYAGUPKGYQOOCWTYYMDTTTJWKT QTSYAGUWJGLROICWTYYMDMTTJDKT PTKYDKUWGGGAOICWTYYIXMTTJTKU QTKYVGUWGGYHOICWTNYMXMTKJTOU VEKYAPUWGOYUQBCSIXEMWMTTJTKU QTKYAGUPKGYQOOCWTYYMDTTTJDKT CTSYAGUWJGYROICWTYYMDMTTJDKT VEKYAPUWGOYUQBCSIXEMWSTEJTKU QTKYAGUWGGKHOICWTNYMXMTKJTKU PTKYDKUWGGGAOICWTYYIXMTTVTKU QTKYAGUWGGKHOICWTJYMXMTKJTKU QTKYAGUWGGKHOICWTJYMXMTKJTKU QTKYAGUWGGYQOICWTYYMDMTTBTKT QTKYAGUWGGKHOICWTNYMCMTKJTKU VEKYAPUWGOYUQBCSIXEMWSTTJTKU QTKYAGUWGGKHOICWTNYMCMTKJTKU QTKYVGUWGGYHOICWTNYMXMTKJTOU QTKYAGUPKGYQOOCWTYYMDTTTJDKT PTKYDKUWGGGAOICWTYYIXMTTJTKU
keiths on February 25, 2016 at 11:01 pm said:

Mung:

Even you Allan, in spite of your “skepticism”, must admit that generating genomes at random starts each genome in a randomly selected area of the search space.

Does that really matter? Does that really change the ability of the program to find the target phrase? Would the result be any different if you started them all out the same or with only a single letter difference between them?

Mung,

If you understood how the program works, you would already know the answer to your questions.

And as Allan points out, you could easily modify the program to initialize the population however you desire. You claim to be a coder. Do some coding for a change!

Third, you can create the desired scenario without even modifying the code, by a judicious application of the commands available to you. See if you can figure out how to do it.

Less carping and more thinking, please.
keiths on February 25, 2016 at 11:17 pm said:

petrushka:

I suspect Mung wants to draw attention away from what the program does model, and toward anything it fails to model.

Mung:

You mean like genetic drift?

petrushka:

It can model drift just fine if you set selection coefficient to 0.

It models drift just fine even with nonzero selection coefficients. The equilibrium Hamming distances are where the effects of drift and selection net out to zero on average.

Poor Mung is going in circles with Weasel. He thinks he’s chasing it, but in reality it is chasing him.
Tom English on February 26, 2016 at 1:46 am said:

DNA_Jock: I think you can model the behavior of this program (for most parameter settings)reasonably accurately as a Markov Chain: for each Hamming distance, you can calculate the probability that the next generation will be one better p(-1), the same p(0), or one worse p(+1).
Masochists and pedants can include p(+2), p(-2), etc, etc…
Then you can use Monte-Carlo to find the equilibrium distribution.
After the initial burn-in has occurred, the distribution is not sensitive to the starting position, thanks to the simplicity of the fitness surface.
I think this is what DiEb did for the Weasels. Joe F does this kind of analysis for a living…

DiEb didn’t simulate. He solved for the expected value of the first hitting time. The hard thing is to acquire a genuine understanding of what that means, not to use an environment like R to do the calculation.

The random number of characters going from correct (incorrect) in the parent to incorrect (correct) in the offspring follows a binomial distribution. That’s the stuff of a first course in probability, so I won’t accept that it’s pedantic.

DNA_Jock: I suspect that, for multiple-survivor situations, running keiths’s model will be easier that figuring out the transition probabilities…

Now I’ll come off as pedantic. Keith has not presented a model. He has distributed a program, and made much of the fact that Mung (who’s laughed for years at people who take seriously someone who calls himself “Mung”) cannot infer the intended model from the source code of the program.

I figured out the model that Keith intended to implement by reading a program that failed to implement it. Keith evidently did not learn the lesson that he should have, because he keeps telling us to run his program to see what he really means. Mung and his fellow UDders are enjoying the spectacle, at After the Milkbar Closes. He’s come through with appropriate software-engineering responses, which I doubt he would have generated on his own.
John Harshman on February 26, 2016 at 1:58 am said:

keiths: The equilibrium Hamming distances are where the effects of drift and selection net out to zero on average.

Shouldn’t this also depend on the mutation rate?
keiths on February 26, 2016 at 2:10 am said:

John,

Shouldn’t this also depend on the mutation rate?

The mutation rate already factors into the magnitude of both drift and selection. See Joe’s new OP.
keiths on February 26, 2016 at 2:13 am said:

Tom,

I figured out the model that Keith intended to implement by reading a program that failed to implement it.

Then by all means point out the flaws so I can fix them!

Mung and his fellow UDders are enjoying the spectacle, at After the Milkbar Closes.

Huh?
John Harshman on February 26, 2016 at 3:04 am said:

keiths:
John,

The mutation rate already factors into the magnitude of both drift and selection.See Joe’s new OP.

Are you saying that the mutation rate cancels out of the calculation of the equilibrium state?
keiths on February 26, 2016 at 3:25 am said:

John,

Are you saying that the mutation rate cancels out of the calculation of the equilibrium state?

No, I’m saying that at equilibrium, the total rate at which letters are changing from matching to mismatching equals the total rate at which they are changing from mismatching to matching. Both of those rates already depend on the mutation rate.
Tom English on February 26, 2016 at 5:04 am said:

petrushka: There are no neutral alleles in weasel, no synonymous or nearly synonymous substitutions.

Joe Felsenstein: petrushka, in my equations for the Wright-Fisher model that comes close to being Weasel, there is a selection coefficient s, and one could make that zero.

I didn’t follow the previous discussion closely. Would you do me the favor of providing (or pointing me to) a terse summary of the model?

It’s bothered me that Keith uses the selection coefficient to scale the overall fitness, instead of the fitness contributions of individual alleles. The latter is what I see in the tutorial that Mung brought to the table.

Joe Felsenstein: Also, at any position where the letter does not match the target, any mutation to another nonmatching letter will behave as if neutral, even though in reality both are deleterious.

Emphasis added to what I don’t understand. The fitness function as a special case of a pseudo-boolean function. That is, for real-valued weights $w_1, w_2, \dotsc, w_n,$ and target sentence $t_1, t_2, \dotsc, t_n,$

$f_\mathbf{w}(x) = \sum_{i=1}^n w_i \cdot I_{t_i}(x_i),$

where

$I_{t}(a) = \begin{cases} 1 & \text{if $a \in \{t\}$} \\ 0 & \text{otherwise.} \end{cases}$

is the indicator function for set $\{t\}.$ I use the indicator function because there’s an easy generalization to sets $T_i$ at the $n$ loci. So an allele $x_i \in T_i$ is beneficial if $w_i$ is positive, and deleterious if $w_i$ is negative. An allele $x_i \notin T_i$ is neutral. Weasel is a special case in which all sets are of size 1, and all weights are equal to 1, i.e., alleles are either beneficial or neutral, never deleterious.

Of course, I’d never argue with you about theoretical biology. I’m drawing on the theory of evolutionary computation (with no claim of expertise). For various evolutionary algorithms, with mutation rate $1/n,$ the expected number of function evaluations to maximize a pseudo-boolean function is on the order of $n \ln n.$ This is an important result, for folks like me. It happens also to provide another way of shooting down the notion that the fitness function must be carefully designed for the algorithm to hit the target. All that matters is the signs of the (nonzero) weights, not their magnitudes.
Tom English on February 26, 2016 at 5:21 am said:

Tom English: I didn’t follow the previous discussion closely. Would you do me the favor of providing (or pointing me to) a terse summary of the model?

From premier scientist to mind reader.
Mung on February 26, 2016 at 5:44 pm said:

keiths: Huh?

Tom is wrong, of course, but even I got it.
keiths on February 26, 2016 at 5:50 pm said:

If you’re talking about the “UDders” pun, I’m sure everyone got it.

The question is why Tom would think you had something to laugh about. Your attacks on Weasel have all failed.
Mung on February 26, 2016 at 5:56 pm said:

No Tom, no one is coaching me.
Mung on February 26, 2016 at 5:59 pm said:

keiths: The question is why Tom would think you had something to laugh about. Your attacks on Weasel have all failed.

Tom is being nice to you. You should be nice to Tom. Taking him seriously would help. You simply cannot be right all the time without fail.
petrushka on February 26, 2016 at 6:06 pm said:

Mung, Tom has moved on to Joe Felsenstein’s thread.
keiths on February 26, 2016 at 6:13 pm said:

Mung,

I do take Tom seriously. Taking someone seriously doesn’t mean passively agreeing with everything he says.

You simply cannot be right all the time without fail.

Of course not. So?
keiths on February 27, 2016 at 1:20 am said:

petrushka,

I installed lcc-win, but my makes are failing due to undefined symbols (screenshot below). The tool doesn’t seem to be able to figure out which libraries it needs, and the “Add libs” function doesn’t work. Specifying the full path to crtdll.lib didn’t help.

Are you using the IDE? If so, could you look under Project->Configuration->Linker and tell me which libraries are being used in your setup?

Or if you’re using the command line, could you give me the specific commands you’re using?

Thanks.
petrushka on February 27, 2016 at 1:36 am said:

I installed in the default directory and use:
\lcc\bin\lc xxx.c
keiths on February 27, 2016 at 2:02 am said:

I tried that too, but it gave me the same results as for the IDE.

Uninstalling and reinstalling didn’t help.
petrushka on February 27, 2016 at 2:11 am said:

It seems to be one bad library. I’ll take a look.
petrushka on February 27, 2016 at 2:25 am said:

look for crtdll.exp in the buildlib folder. Open it with notepad or a text editor and have a look.

crtdll.lib should be in the lib folder
petrushka on February 27, 2016 at 2:31 am said:

From “LCC-Win32 User’s Manual”, pages 55-58, by Jacob Navia.

The Linker

The general format of a link command line is as follows:

lcclnk [options] object-files library-files

The linker will default to the following import libraries:

LIBC.LIB A few C functions are in here
KERNEL32.LIB Kernel function calls
COMDLG32.LIB Common dialog boxes library
USER32.LIB Windows ‘user’ functions
GDI32.LIB GDI functions
ADVAPI32.LIB More Windows functions
COMCTL32.LIB Library for the common controls of Win32
CRTDLL.LIB Library for the ‘C’ runtime functions

These libraries do not need to be specified in the command line because the
linker will search them anyway.
petrushka on February 27, 2016 at 2:32 am said:

my crtdll.lib is 436kb
keiths on February 27, 2016 at 3:33 am said:

I moved to an older machine, Win 7 instead of Win 10, and now everything is working.