Does gpuccio’s argument that 500 bits of Functional Information implies Design work?

Posted on May 21, 2018 by Joe Felsenstein

On Uncommon Descent, poster gpuccio has been discussing “functional information”. Most of gpuccio’s argument is a conventional “islands of function” argument. Not being very knowledgeable about biochemistry, I’ll happily leave that argument to others.

But I have been intrigued by gpuccio’s use of Functional Information, in particular gpuccio’s assertion that if we observe 500 bits of it, that this is a reliable indicator of Design, as here, about at the 11th sentence of point (a):

… the idea is that if we observe any object that exhibits complex functional information (for example, more than 500 bits of functional information ) for an explicitly defined function (whatever it is) we can safely infer design.

I wonder how this general method works. As far as I can see, it doesn’t work. There would be seem to be three possible ways of arguing for it, and in the end; two don’t work and one is just plain silly. Which of these is the basis for gpuccio’s statement? Let’s investigate …

A quick summary

Let me list the three ways, briefly.

(1) The first is the argument using William Dembski’s (2002) Law of Conservation of Complex Specified Information. I have argued (2007) that this is formulated in such a way as to compare apples to oranges, and thus is not able to reject normal evolutionary processes as explanations for the “complex” functional information. In any case, I see little sign that gpuccio is using the LCCSI.

(2) The second is the argument that the functional information indicates that only an extremely small fraction of genotypes have the desired function, and the rest are all alike in totally lacking any of this function. This would prevent natural selection from following any path of increasing fitness to the function, and the rareness of the genotypes that have nonzero function would prevent mutational processes from finding them. This is, as far as I can tell, gpuccio’s islands-of-function argument. If such cases can be found, then explaining them by natural evolutionary processes would indeed be difficult. That is gpuccio’s main argument, and I leave it to others to argue with its application in the cases where gpuccio uses it. I am concerned here, not with the islands-of-function argument itself, but with whether the design inference from 500 bits of functional information is generally valid.

We are asking here whether, in general, observation of more than 500 bits of functional information is “a reliable indicator of design”. And gpuccio’s definition of functional information is not confined to cases of islands of function, but also includes cases where there would be a path to along which function increases. In such cases, seeing 500 bits of functional information, we cannot conclude from this that it is extremely unlikely to have arisen by normal evolutionary processes. So the general rule that gpuccio gives fails, as it is not reliable.

(3) The third possibility is an additional condition that is added to the design inference. It simply declares that unless the set of genotypes is effectively unreachable by normal evolutionary processes, we don’t call the pattern “complex functional information”. It does not simply define “complex functional information” as a case where we can define a level of function that makes probability of the set less than $2^{-500}$ . That additional condition allows us to safely conclude that normal evolutionary forces can be dismissed — by definition. But it leaves the reader to do the heavy lifting, as the reader has to determine that the set of genotypes has an extremely low probability of being reached. And once they have done that, they will find that the additional step of concluding that the genotypes have “complex functional information” adds nothing to our knowledge. CFI becomes a useless add-on that sounds deep and mysterious but actually tells you nothing except what you already know. So CFI becomes useless. And there seems to be some indication that gpuccio does use this additional condition.

Let us go over these three possibilities in some detail. First, what is the connection of gpuccio’s “functional information” to Jack Szostak’s quantity of the same name?

Is gpuccio’s Functional Information the same as Szostak’s Functional Information?

gpuccio acknowledges that gpuccio’s definition of Functional Information is closely connected to Jack Szostak’s definition of it. gpuccio notes here:

Please, not[e] the definition of functional information as:

“the fraction of all possible configurations of the system that possess a degree of function >=
Ex.”

which is identical to my definition, in particular my definition of functional information as the
upper tail of the observed function, that was so much criticized by DNA_Jock.

(I have corrected gpuccio’s typo of “not” to “note”, JF)

We shall see later that there may be some ways in which gpuccio’s definition
is modified from Szostak’s. Jack Szostak and his co-authors never attempted any use of his definition to infer Design. Nor did Leslie Orgel, whose Specified Information (in his 1973 book The Origins of Life) preceded Szostak’s. So the part about design inference must come from somewhere else.

gpuccio seems to be making one of three possible arguments;

Possibility #1 That there is some mathematical theorem that proves that ordinary evolutionary processes cannot result in an adaptation that has 500 bits of Functional Information.

Use of such a theorem was attempted by William Dembski, his Law of Conservation of Complex Specified Information, explained in Dembski’s book No Free Lunch: Why Specified Complexity Cannot Be Purchased without Intelligence (2001). But Dembski’s LCCSI theorem did not do what Dembski needed it to do. I have explained why in my own article on Dembski’s arguments (here). Dembski’s LCCSI changed the specification before and after evolutionary processes, and so he was comparing apples to oranges.

In any case, as far as I can see gpuccio has not attempted to derive gpuccio’s argument from Dembski’s, and gpuccio has not directly invoked the LCCSI, or provided a theorem to replace it. gpuccio said in a response to a comment of mine at TSZ,

Look, I will not enter the specifics of your criticism to Dembski. I agre with Dembski in most things, but not in all, and my arguments are however more focused on empirical science and in particular biology.

While thus disclaiming that the argument is Dembski’s, on the other hand gpuccio does associate the argument with Dembski here by saying that

Of course, Dembski, Abel, Durston and many others are the absolute references for any discussion about functional information. I think and hope that my ideas are absolutely derived from theirs. My only purpose is to detail some aspects of the problem.

and by saying elsewhere that

No generation of more than 500 bits has ever been observed to arise in a non design system (as you know, this is the fundamental idea in ID).

That figure being Dembski’s, this leaves it unclear whether gpuccio is or is not basing the argument on Dembski’s. But gpuccio does not directly invoke the LCCSI, or try to come up with some mathematical theorem that replaces it.

So possibility #1 can be safely ruled out.

Possibility #2. That the target region in the computation of Functional Information consists of all of the sequences that have nonzero function, while all other sequences have zero function. As there is no function elsewhere, natural selection for this function then cannot favor sequences closer and closer to the target region.

Such cases are possible, and usually gpuccio is talking about cases like this. But gpuccio does not require them in order to have Functional Information. gpuccio does not rule out that the region could be defined by a high level of function, with lower levels of function in sequences outside of the region, so that there could be paths allowing evolution to reach the target region of sequences.

An example in which gpuccio recognizes that lower levels of function can exist outside the target region is found here, where gpuccio is discussing natural and artificial selection:

Then you can ask: why have I spent a lot of time discussing how NS (and AS) can in some cases add some functional information to a sequence (see my posts #284, #285 and #287)

There is a very good reason for that, IMO.

I am arguing that:

1) It is possible for NS to add some functional information to a sequence, in a few very specific cases, but:

2) Those cases are extremely rare exceptions, with very specific features, and:

3) If we understand well what are the feature that allow, in those exceptional cases, those limited “successes” of NS, we can easily demonstrate that:

4) Because of those same features that allow the intervention of NS, those scenarios can never, never be steps to complex functional information.

Jack Szostak defined functional information by having us define a cutoff level of function to define a set of sequences that had function greater than that, without any condition that the other sequences had zero function. Neither did Durston. And as we’ve seen gpuccio associates his argument with theirs.

So this second possibility could not be the source of gpuccio’s general assertion about 500 bits of functional information being a reliable indicator of design, however much gpuccio concentrates on such cases.

Possibility #3. That there is an additional condition in gpuccio’s Functional Information, one that does not allow us to declare it to be present if there is a way for evolutionary processes to achieve that high a level of function. In short, if we see 500 bits of Szostak’s functional information, and if it can be put into the genome by natural evolutionary processes such as natural selection then for that reason we declare that it is not really Functional Information. If gpuccio is doing this, then gpuccio’s Functional Information is really a very different animal than Szostak’s functional information.

Is gpuccio doing that? gpuccio does associate his argument with William Dembski’s, at least in some of his statements. And William Dembski has defined his Complex Specified Information in this way, adding the condition that it is not really CSI unless it is sufficiently improbable that it be achieved by natural evolutionary forces (see my discussion of this here in the section on “Dembski’s revised CSI argument” that refer to Dembski’s statements here). And Dembski’s added condition renders use of his CSI a useless afterthought to the design inference.

gpuccio does seem to be making a similar condition. Dembski’s added condition comes in via the calculation of the “probability” of each genotype. In Szostak’s definition, the probabilities of sequences are simply their frequencies among all possible sequences, with each being counted equally. In Dembski’s CSI calculation, we are instead supposed to compute the probability of the sequence given all evolutionary processes, including natural selection.

gpuccio has a similar condition in the requirements for concluding that complex
functional information is present: We can see it at step (6) here:

If our conclusion is yes, we must still do one thing. We observe carefully the object and what we know of the system, and we ask if there is any known and credible algorithmic explanation of the sequence in that system. Usually, that is easily done by excluding regularity, which is easily done for functional specification. However, as in the particular case of functional proteins a special algorithm has been proposed, neo darwininism, which is intended to explain non regular functional sequences by a mix of chance and regularity, for this special case we must show that such an explanation is not credible, and that it is not supported by facts. That is a part which I have not yet discussed in detail here. The necessity part of the algorithm (NS) is not analyzed by dFSCI alone, but by other approaches and considerations. dFSCI is essential to evaluate the random part of the algorithm (RV). However, the short conclusion is that neo darwinism is not a known and credible algorithm which can explain the origin of even one protein superfamily. It is neither known nor credible. And I am not aware of any other algorithm ever proposed to explain (without design) the origin of functional, non regular sequences.

In other words, you, the user of the concept, are on your own. You have to rule out that natural selection (and other evolutionary processes) could reach the target sequences. And once you have ruled it out, you have no real need for the declaration that complex functional information is present.

I have gone on long enough. I conclude that the rule that observation of 500 bits of functional information is present allows us to conclude in favor of Design (or at any rate, to rule out normal evolutionary processes as the source of the adaptation) is simply nonexistent. Or if it does exist, it is as a useless add-on to an argument that draws that conclusion for some other reason, leaving the really hard work to the user.

Let’s end by asking gpuccio some questions:
1. Is your “functional information” the same as Szostak’s?
2. Or does it add the requirement that there be no function in sequences that
are outside of the target set?
3. Does it also require us to compute the probability that the sequence arises as a result of normal evolutionary processes?

1,971 thoughts on “Does gpuccio’s argument that 500 bits of Functional Information implies Design work?”

Joe Felsenstein on May 31, 2018 at 10:12 pm said:

In Szostak’s (and Hazen et al.’s) definitions of FI there is no “target”. Any value of function can be used to set a threshold, and the value of FI is a result of that.

If we want to calculate the FI of a string, we use its function value as the threshold. Using the Hazen et al. “greater than or equal to” definition, every string has an FI, and the FI of the string that has the lowest function value is 0.
DNA_Jock on May 31, 2018 at 10:25 pm said:

What Joe said.

Mung, you are attempting to convert the “function”, which Szostak and Hazen (and everybody else except Mung) view as continuous, to a function that can only take two values.

Even gpuccio gets this.
Mung on May 31, 2018 at 10:41 pm said:

DNA_Jock: Mung, you are attempting to convert the “function”, which Szostak and Hazen (and everybody else except Mung) view as continuous, to a function that can only take two values.

I probably just don’t understand what you’re saying, but you seem to be equivocating over the term “function.”
Mung on May 31, 2018 at 10:46 pm said:

It is important to emphasize that functional information, unlike previous complexity measures, is based on a statistical property of an entire system of numerous agent configurations (e.g., sequences of letters, RNA oligonucleotides, or a collection of sand grains) with respect to a specific function. To quantify the functional information of any given configuration, we need to know both the degree of function of that specific configuration and the distribution of function for all possible configurations in the system. This distribution must be derived from the statistical properties of the system as a whole [as opposed, for example, to the statistical properties of populations evolving in a fitness landscape (37)]. Any analysis of the functional information of a specific functional sequence or object, therefore, requires a deep understanding of the system’s agents and their various interactions.
Joe Felsenstein on May 31, 2018 at 10:48 pm said:

The “function” associated with each string is a number. (It is how good it is at something, and is to be distinguished from a mathematical “function”). It could be on a continuous scale (such as the nonnegative reals). It could even be just 0 or 1.

The scale on which function is measured can be continuous, but of course with only a finite number of possible strings, there will not be a continuum of values of the function.
DNA_Jock on May 31, 2018 at 10:57 pm said:

see this for what I mean by ‘function’.
Looks continuous to me.

Your quote

This distribution must be derived from the statistical properties of the system as a whole [as opposed, for example, to the statistical properties of populations evolving in a fitness landscape (37)]

does not have the import you appear to think it does.
To calculate FI one needs to know the proportion of sequences above the threshold. The point your quote is making is, the denominator is ALL sequences in the sequence space, as opposed to all sequences observed.
Mung on May 31, 2018 at 10:58 pm said:

Joe Felsenstein: Any value of function can be used to set a threshold, and the value of FI is a result of that.

Yes, I know. Thanks.

Mung: FI is not something that is “produced” except by the mind of the person deciding what the function is and what the threshold is. One can choose any function and any threshold for that function.
Mung on May 31, 2018 at 11:00 pm said:

DNA_Jock: To calculate FI one needs to know the proportion of sequences above the threshold.

Isn’t that what I have been saying?
Joe Felsenstein on May 31, 2018 at 11:05 pm said:

DNA_Jock: Looks continuous to me.

That is an “artist’s conception” showing a continuum of values. However with constant-length strings there can be only a finite number of actual values.
Joe Felsenstein on May 31, 2018 at 11:09 pm said:

Mung: FI is not something that is “produced” except by the mind of the person deciding what the function is and what the threshold is. One can choose any function and any threshold for that function.

One can also “‘produce’ in the mind” of the investigator the notion that it is the activity of ATP synthase, and then go out and measure that number for each string.

I sense some semantic wrangle coming over the word “produce”. In my discussions of FI that word does not come up.
Tom English on May 31, 2018 at 11:18 pm said:

Corneel: It does not properly model natural selection, as the algorithm simply picks a single best matching string every generation, instead of having differential survival and reproduction among individuals within a population.

You’ve just appealed to the algorithm in explanation of the model, when the first question, always, is whether the algorithm correctly implements the model. Computers and programs are just tools for investigation. When you find yourself speaking in terms of the tools, you generally should stop yourself, and switch to speaking in terms of what you are using the tools to investigate. I don’t see how there could be argument on this point. I do know that it’s sometimes convenient to speak in terms of the algorithm. But you can do that only with people who already have a pretty good understanding of the model. The “convenience” is no convenience when it contributes to all sorts of misapprehensions and misrepresentations.

Treating programs as though they, themselves, were the models is essential to the “evolutionary informatics” branch of ID. It leads to a terribly confused conclusion, “The model works only because the programmer informed the computer what to do.”

You’re one of my favorites in the Zone, and nothing that I’m saying is directed particularly at you. I’ve worked hard at understanding the misunderstandings of others, and I’m pretty sure that I’ve gotten at a primary source of confusion. Conflation of the simulator and the simuland is central to ID, and I believe that we should do all that we can to maintain a clear distinction of the two.
dazz on May 31, 2018 at 11:49 pm said:

Mung:

I take that to mean that FI is a tool meant to characterize the fitness landscape, not to model evolutionary processes. What’s your point?
Tom English on May 31, 2018 at 11:58 pm said:

Joe Felsenstein: I sense some semantic wrangle coming over the word “produce”. In my discussions of FI that word does not come up.

On the ID reading of the Gospel of Dawkins, everyone who “believes in” evolution must believe that evolution creates information (when it is of course only minds [called intelligences for no better reason than that Philip Johnson latched onto the term intelligent design] that create information).

Whatever notion of “information” an IDist latches onto, you can count on him (or half-a-her) to pin on scientists a claim that mindless processes create it. However, you cannot count on them to use the word create, sensitive as they are to claims that ID is creationism in a cheap tuxedo.

It does seem as though Mung is saying that a claim that has not appeared in the literature, and does not appear in your post, is incoherent.
DNA_Jock on June 1, 2018 at 12:02 am said:

Joe Felsenstein,

We agree. As originally formulated by Szostak and Hazen, the function that has functional information is continuous. In the case of TBW Weasel, however, the Hamming distance can only have integer values between 0 and 28.
Each of these Hamming distances has an associated FI, from 0 to 133.1, so in Weasel, ‘greater than observed value’ produces goofiness that ‘greater than or equal to observed value’ does not. For a continuous function, not so much of an issue…
Mung is attempting a side-track in making the ‘function’ take only two values.
He’s equivocating between the value of the function, and the value of the fn >= th inequality. It’s boring.
Tom English on June 1, 2018 at 12:17 am said:

Tom English: (or half-a-her)

Grossly unfair to female ID activists whose names do not begin with a big zero.
DNA_Jock on June 1, 2018 at 12:22 am said:

In passing, I will note that when Mung writes

One can choose any function and any threshold for that function.

he is absolutely correct.
the FI depends on the function chosen, and the threshold of activity chosen. The convention is to set the threshold equal to some observed level of function. The kcat of Rubisco. Given a definition of ‘function’ (e.g. Hamming distance to a Hamlet quote) and an associated threshold, one can in theory calculate the FI. gpuccio has a problem in that he cannot calculate the FI in practice, whether for strings having “good meaning in English”, or for proteins. Hey, in the case of English phrases, he did at least try. He also has a problem in that FI does not have the import that he thinks it does. Finally he has a problem in that his choice of function and of threshold are examples of TSS.
So in that regard, at least, Mung has a point.
vjtorley on June 1, 2018 at 12:27 am said:

Hi RoyLT

I would propose a different example from a more recent film. In ‘Interstellar’, a worm-hole is found near Saturn. The worm-hole allows instantaneous travel to different star systems. Ultimately, the conclusion is drawn that humans from the distant future placed the worm-hole as a means for humans to escape Earth.

If such an anomaly were detected in real life, how would we decide whether to draw a design inference? We have no experience of ‘building’ worm-holes and the physics of such things are theoretical in the extreme. Without clear evidence of who/what could have made it and what their motive was, I suspect that we would have no way of determining whether the anomaly was naturally occurring or designed.

My design inference in the case of the lunar monolith had nothing to do with the fact that we happen to have experience of intelligent beings (humans) making monoliths. Rather, it was made on the basis of the striking mathematical properties of the monolith’s dimensions – properties which were unlikely to be accidentally duplicated by a natural process.

I agree with you that if we discovered a worm-hole, we’d have no way of knowing whether it was designed or not, unless it had some very interesting mathematical properties. Of course, if we discovered lots of wormholes, and they were all situated near life-bearing planets, and if such planets were very rare in the cosmos, then we might infer design.
Joe Felsenstein on June 1, 2018 at 12:27 am said:

DNA_Jock: the function that has functional information is continuous. In the case of TBW Weasel, however, the Hamming distance can only have integer values between 0 and 28.

Let’s distinguish between the potential values of the function being continuous, and the values that actually occur being a continuum on that scale. Which they aren’t.

We have enough trouble as is with the notions of biological function and of a mathematical function.
Tom English on June 1, 2018 at 12:44 am said:

It’s not trivial to observe that if you treat the threshold level of function as a variable rather than as a constant, then you have parted ways with classical (discrete) information theory, in which probability is distributed over events in a partition of the sample space. That is, you end up associating quantities of information with events that are proper subsets of other events with which you associate quantities of information. If you were to do the same in the context of communication theory, then most of the sense of regarding the quantities as information would be lost.

Whatever you mean by information, there has got to be more to it than expressing the improbability 1/P(E) of an event E on a logarithmic scale. Transforming the scale on which a quantity is expressed does not change the answer to the question Quantity of what?
dazz on June 1, 2018 at 1:28 am said:

vjtorley: Rather, it was made on the basis of the striking mathematical properties of the monolith’s dimensions – properties which were unlikely to be accidentally duplicated by a natural process.

All one could say is that there’s no *known* process, natural or not, capable of explaining the monolith. And why “accidentally”? Why would the alternative to design need to be a natural accident? It’s always the same false dichotomy with you guys: “design” or else, serendipity, and both lack explanatory power.

Seems to me that to explain something one would need to find some regularity that actually predicts the observed findings.
Joe Felsenstein on June 1, 2018 at 8:03 am said:

Tom English: It’s not trivial to observe that if you treat the threshold level of function as a variable rather than as a constant, then you have parted ways with classical (discrete) information theory, in which probability is distributed over events in a partition of the sample space. That is, you end up associating quantities of information with events that are proper subsets of other events with which you associate quantities of information.

I’m not sure I agree. Sure, as we narrow down on a smaller (and better and better) subset of possibilities, those subsets are nested. But each allows us to compute the information for choosing that subset out of the whole set. And when we choose a smaller subset out of a larger one, we can calculate the information needed to choose the smaller one from the larger one. And the sum of those information gains will equal the total, since of course

$1/P_3 \ = \ (1/P_1) (P_1/P_2) (P_2/P_3)$

so that when logs are taken, the product becomes a sum.
Corneel on June 1, 2018 at 10:12 am said:

Tom English: You’ve just appealed to the algorithm in explanation of the model, when the first question, always, is whether the algorithm correctly implements the model. Computers and programs are just tools for investigation. When you find yourself speaking in terms of the tools, you generally should stop yourself, and switch to speaking in terms of what you are using the tools to investigate.

OK, I think I see what you mean. I got lured into a fruitless discussion about the nuts and bolts of the weasel program, whereas a simple reminder that in natural populations differential survival and reproduction are bringing information into the system as well would have sufficed. That about right?

Tom English: Treating programs as though they, themselves, were the models is essential to the “evolutionary informatics” branch of ID. It leads to a terribly confused conclusion, “The model works only because the programmer informed the computer what to do.”

Heh. Now you point it out, I can see the parallels between the Designer and the Programmer.
Corneel on June 1, 2018 at 12:07 pm said:

colewd: The fact is in order to model this “cumulative selection” you need to have knowledge of the sequence is a very telling demonstration of the challenges that sequences face when there is any random change involved.

No, you do not need knowledge of the optimal sequence. What you do need is information about the relative performance of the sequences in your current population. If your function of interest is associated with variation in fitness, this information is readily available. The selection process will then automatically enrich the next population with sequences that have a high degree of function.

ETA: assuming that high degree of function = high fitness
RoyLT on June 1, 2018 at 2:17 pm said:

vjtorley: I agree with you that if we discovered a worm-hole, we’d have no way of knowing whether it was designed or not, unless it had some very interesting mathematical properties.

I don’t understand your caveat. Unless we had some knowledge of how such things are made, the mathematics would not be very informative no matter how interesting. And if you’ll forgive me a momentary lapse into cantankerous whining, ‘interesting mathematical properties’ would immediately put it under the umbrella of Fine-Tuning which has IMHO even less explanatory power than the design inference.
RoyLT on June 1, 2018 at 2:18 pm said:

vjtorley: Of course, if we discovered lots of wormholes, and they were all situated near life-bearing planets, and if such planets were very rare in the cosmos, then we might infer design.

Why would the those conditions lead you to consider a design inference when a single wormhole would not?
Mung on June 1, 2018 at 3:18 pm said:

Joe Felsenstein: I sense some semantic wrangle coming over the word “produce”. In my discussions of FI that word does not come up.

Then you missed my point entirely. My point is that I have been saying for some time the same things that you all are now saying.
Mung on June 1, 2018 at 3:25 pm said:

Tom English: It does seem as though Mung is saying that a claim that has not appeared in the literature, and does not appear in your post, is incoherent.

In my own inimitable way, I was agreeing with Joe.

Any value of function can be used to set a threshold, and the value of FI is a result of that.

To calculate the FI you have to set the threshold. You can set the threshold to whatever you like. Where you set the threshold affects the value you will get for FI.

Is everybody on board with that now?
Mung on June 1, 2018 at 3:28 pm said:

DNA_Jock: So in that regard, at least, Mung has a point.

I have my moments.

I disagree that I am in any way attempting a side track. Certainly not intentionally.
Mung on June 1, 2018 at 3:33 pm said:

DNA_Jock: Joe’s point is that there is a path from a just below the peak of Ben Nevis to the peak of Ben Nevis (which has over 500 bits of FI…)

If I understand gpuccio, it needs to be a 500 bit leap, not 500 1 bit steps.
Mung on June 1, 2018 at 3:35 pm said:

dazz: I take that to mean that FI is a tool meant to characterize the fitness landscape, not to model evolutionary processes.

Where do you get that idea from?
Mung on June 1, 2018 at 3:51 pm said:

Joe Felsenstein: so that when logs are taken, the product becomes a sum

That is one of the “side-tracks” I was attempting to cover in my conversation with Corneel. I wanted to know the “information gain” at each step and add them up to see what we get. Part of the reason I keep asking for actual numbers and examples.

In your own example it seems we went from 782 to 794. But how did we get to 782 and can you do it without changing how you have defined the threshold of function?

I say you can’t. And I say if you have to redefine the threshold of function at each step of the process to make it do what you need it to do it’s all ad-hoccery.

Compare that to what DNA_Jock has proposed as degree of function for the WEASEL program. It doesn’t change.
dazz on June 1, 2018 at 3:59 pm said:

Mung: To calculate the FI you have to set the threshold. You can set the threshold to whatever you like. Where you set the threshold affects the value you will get for FI.

Is everybody on board with that now?

But you were proposing to calculate FI based on the degree of function of a sequence not in the ensemble, which doesn’t make any sense to me.

I think the crux of the matter is that setting the threshold determines the amount of functional information necessary to achieve that level of function in that particular ensemble, so all sequences with THAT level of function can be said in a way to “possess” that much FI. Setting an arbitrary threshold tells you nothing about the FI of sequences with different degree of function / fitness, as far as I can tell
dazz on June 1, 2018 at 4:05 pm said:

Mung: Where do you get that idea from?

From the fact that (I think) FI is not about explaining evolutionary processes. For what FI is concerned, it doesn’t matter if a certain sequence is more or less probable to be found by RV+NS.

So it makes no sense to apply it the way Puccio does, assuming equiprobability of all sequences under the assumption of RV+NS
Mung on June 1, 2018 at 4:11 pm said:

DNA_Jock: In the case of TBW Weasel, however, the Hamming distance can only have integer values between 0 and 28.
Each of these Hamming distances has an associated FI, from 0 to 133.1…

So if the threshold (degree of function) is set to any hamming distance > 0 (dazz doesn’t seem to think that makes any sense, but maybe he is wrong), then what is the FI? What proportion of strings meet or exceed that threshold? Or to put it in your own words, “the proportion of the sequences that have function level X or above.”
dazz on June 1, 2018 at 4:12 pm said:

Mung: So if the threshold (degree of function) is set to any hamming distance > 0 (dazz doesn’t seem to think that makes any sense, but maybe he is wrong),

Where did I say that?
Mung on June 1, 2018 at 4:15 pm said:

dazz: From the fact that (I think) FI is not about explaining evolutionary processes. For what FI is concerned, it doesn’t matter if a certain sequence is more or less probable to be found by RV+NS.

Well, I think you are partially correct.

But if it’s not about explaining evolutionary processes what does a “fitness landscape” have to do with it?

I take that to mean that FI is a tool meant to characterize the fitness landscape…

Oh, and this whole exercise Joe is proposing is based on RV+NS. Perhaps you could tell him how misguided you think he is?
Corneel on June 1, 2018 at 4:17 pm said:

gpuccio@UD

Me:Anyway, I am perfectly aware that the target string is being evaluated to find the number of matches, but I don’t see why that would prevent us from calculating the functional information of the resulting strings, which is what the whole exercise is about.

gpuccio: Because the whole exercise is stupid. And I am being very generous here.

If the string is alredy in the system, the simplest way to get it in a random string is to substitute each letter in the initial random string with the right letter from the target. That requires only as many sibstitutions as the string is ling. Or, if you have a printer at hand, you can simple print the target by a single click.

Does Corneel really believe that if I have a file with a Shakespeare sonnet, and I print 10 copies of it, I am generating new functional information, indeed 10 times the original functional information. If he really believes that, he is completely out of reach, and cannot be saved in any way.

Well, the idea was not mine. And I still don’t see why we can’t calculate the functional information of weasel strings. It is not really different from calculating the functional information of a coherent English paragraph. At least I arrived at the correct answer 🙂
Mung on June 1, 2018 at 4:20 pm said:

dazz: Where did I say that?

Really?

This post:

dazz: the probability that a random sequence will have a greater degree of function than 0 in that case is 1. log2(1) = 0. So FI=0
Mung on June 1, 2018 at 4:23 pm said:

Corneel: And I still don’t see why we can’t calculate the functional information of weasel strings.

I am certainly still trying, lol.

See this post.
colewd on June 1, 2018 at 4:24 pm said:

Corneel,

The selection process will then automatically enrich the next population with sequences that have a high degree of function.

High degree of function compared to what? Enrich means that the sequences are better then the previous ones. Where did these enriched sequences come from? Can you make a real world example of how this works?
dazz on June 1, 2018 at 4:26 pm said:

Mung: Really?

This post:

That was the answer to the question of how much FI there is for the only sequence with zero function in an ensemble
Joe Felsenstein on June 1, 2018 at 4:41 pm said:

Mung: In your own example it seems we went from 782 to 794. But how did we get to 782 and can you do it without changing how you have defined the threshold of function?

I say you can’t. And I say if you have to redefine the threshold of function at each step of the process to make it do what you need it to do it’s all ad-hoccery.

No, it’s not.

How did we get to 782? The Weasel got there, without FI being calculated in any part of it. But I think you meant “how do we get there in our analysis of FI?”

The threshold is changed. Each generation we can use the current string, define the threshold as its number of matches, and ask how large $P$ is with that threshold (using, I’d hope, the revised “greater than or equal to” definition of Hazen et al.), then take minus the log to the base 2.

It’s simply us asking how far the Weasel has got out into the tails of a monkeys-with-typewriters distribution of matches.

A legitimate thing for us to ask, so no ad-hocery.
Mung on June 1, 2018 at 4:49 pm said:

Joe Felsenstein: A legitimate thing for us to ask, so no ad-hocery.

My bad. It’s side-trackery! 😉

But Joe, asking how many matches is asking a question, not setting a threshold. You have to say how many matches are required to meet the minimum degree of function threshold.

Is it one match (DAWKINS)? Is it all but one match (FELSENSTEIN)?

Are you going to change the number of matches required at each iteration of the algorithm? That’s not how WEASEL works.
Mung on June 1, 2018 at 4:53 pm said:

If there are multiple sequences with a given activity, then the corresponding functional information will always be less than the amount of information required to specify any particular sequence. It is important to note that functional information is not a property of any one molecule [or string or letter sequence – Mung], but of the ensemble of all possible sequences, ranked by activity

Why don’t people get this?
DNA_Jock on June 1, 2018 at 4:56 pm said:

Mung: So if the threshold (degree of function) is set to any hamming distance > 0 (dazz doesn’t seem to think that makes any sense, but maybe he is wrong), then what is the FI? What proportion of strings meet or exceed that threshold? Or to put it in your own words, “the proportion of the sequences that have function level X or above.”

This has already been covered, Mung.
The FI is -logbase2(proportion of sequence space where fn >= th)
See the nifty plot here.
Mung on June 1, 2018 at 4:58 pm said:

dazz: That was the answer to the question of how much FI there is for the only sequence with zero function in an ensemble

Then you don’t understand FI and you misunderstood my post.

FI is calculated for the proportion that has function above a threshold. And my post defined the minimum degree of function required and it was not “zero function.”
DNA_Jock on June 1, 2018 at 5:01 pm said:

Mung: If I understand gpuccio, it needs to be a 500 bit leap, not 500 1 bit steps.

Wrong. gpuccio states that it need only be an absolute value of 500. He defends his position by asserting that steps do not exist.
dazz on June 1, 2018 at 5:02 pm said:

Mung: Then you don’t understand FI and you misunderstood my post.

FI is calculated for the proportion that has function above a threshold. And my post defined the minimum degree of function required and it was not “zero function.”

Then it’s you who doesn’t understand FI, because it makes no sense to calculate FI for a degree of function using a threshold of a different degree of function.

“What’s the FI of 0 function for a function threshold >1” is nonsensical
Mung on June 1, 2018 at 5:02 pm said:

DNA_Jock: This has already been covered, Mung.

It either has not been covered or I fail to understand how it has been covered.

Can we start over from the beginning with a simplified example?
dazz on June 1, 2018 at 5:06 pm said:

DNA_Jock: Wrong. gpuccio states that it need only be an absolute value of 500. He defends his position by asserting that steps do not exist.

Yeah, and as already pointed out, if the “target” sits in an island of function surrounded by a vast ocean of non-function, there’s no way to get there incrementally, so it’s *poof*!, 500 bits of protein out of the blue