Evolution and Functional Information

Posted on March 8, 2017 by swamidass

Here, one of my brilliant MD PhD students and I study one of the “information” arguments against evolution. What do you think of our study?

I recently put this preprint in biorxiv. To be clear, this study is not yet peer-reviewed, and I do not want anyone to miss this point. This is an “experiment” too. I’m curious to see if these types of studies are publishable. If they are, you might see more from me. Currently it is under review at a very good journal. So it might actually turn the corner and get out there. An a parallel question: do you think this type of work should be published?

I’m curious what the community thinks. I hope it is clear enough for non-experts to follow too. We went to great lengths to make the source code for the simulations available in an easy to read and annotated format. My hope is that a college level student could follow the details. And even if you can’t, you can weigh in on if the scientific community should publish this type of work.

Functional Information and Evolution

http://www.biorxiv.org/content/early/2017/03/06/114132

“Functional Information”—estimated from the mutual information of protein sequence alignments—has been proposed as a reliable way of estimating the number of proteins with a specified function and the consequent difficulty of evolving a new function. The fantastic rarity of functional proteins computed by this approach emboldens some to argue that evolution is impossible. Random searches, it seems, would have no hope of finding new functions. Here, we use simulations to demonstrate that sequence alignments are a poor estimate of functional information. The mutual information of sequence alignments fantastically underestimates of the true number of functional proteins. In addition to functional constraints, mutual information is also strongly influenced by a family’s history, mutational bias, and selection. Regardless, even if functional information could be reliably calculated, it tells us nothing about the difficulty of evolving new functions, because it does not estimate the distance between a new function and existing functions. Moreover, the pervasive observation of multifunctional proteins suggests that functions are actually very close to one another and abundant. Multifunctional proteins would be impossible if the FI argument against evolution were true.

216 thoughts on “Evolution and Functional Information”

Tom English on March 10, 2017 at 5:34 am said:

Tom English: Is your Equation 2 equivalent to their Equation 6, or is there something else going on?

swamidass: You got it. Nothing else going on.

Having read your paper, I see a huge difference. You’ve dropped their time indices. They assume maximum entropy in the past, and consider the decrease in entropy over time. You assume minimum entropy in the past, and consider the increase in entropy over time. For you, maximum entropy is merely a standard of comparison. For them, maximum entropy is a reality of the past. You’re tacitly saying that their model is grossly inappropriate — and it is, of course. In my opinion, you need to say it outright. You’re misleading readers by detaching the measure from their model.
swamidass on March 10, 2017 at 7:00 am said:

Tom English: Having read your paper, I see a huge difference. You’ve dropped their time indices. They assume maximum entropy in the past, and consider the decrease in entropy over time. You assume minimum entropy in the past, and consider the increase in entropy over time. For you, maximum entropy is merely a standard of comparison. For them, maximum entropy is a reality of the past. You’re tacitly saying that their model is grossly inappropriate — and it is, of course. In my opinion, you need to say it outright. You’re misleading readers by detaching the measure from their model.

I do not understand your objection.

Do you agree I computing exactly the same number as them? It is algebraically equivalent. I’m not making any assumptions here to derive this except that I have correctly implemented the computation of their formula. Of course, it is always possible an error is made, but I made the code available. As I said earlier, there are many ways to arrive at this equation, and they actually have more clear derivations in the cited literature. From that final formula, I that it is identical to the MISA of extant sequences. If you actually think the number I am computing is different, that would be worth pointing out, but I do not see it. The derivation is beside the point, and do not even present a derivation.

Also, they do not actually measure Eqn 6 over time. If you read the referenced papers, they just sample extant sequences. So t just equals “now”. I think the inclusion of the t is a holdover from the other papers by Hazen they are referencing.

From that starting point (that I correctly implement their formula), I just ask what types of sequences are expected from a single ancestral sequence with neutral drift. Of course, they do not consider this case, but that is exactly my point. It turns out this simple model of neutral drift produces extant sequences with high MISA. So now I have demonstrated that there is a way of generating sequences with high MISA from a “mindless” process. So either neutral drift can produce sequences with high FI, or MISA (their formula) is a horrible estimate of FI. It is a “proof by example.”

So I dispute your point. I do not “assume” minimum entropy (and zero functional constraints). I just show that a simulation with a starting point and a random process that is physically possible generates sequences with arbitrarily high MISA. And this, by the way, is close to the generative model for evolution.

Any ways, do you think there is a clearer way to explain this? Or do you still think I am miscommunicating it?
swamidass on March 10, 2017 at 7:06 am said:

Joe Felsenstein: Your formulas for FI use the fraction of all sequences that have the function. Later you distinguish between lower and higher-functioning sequences. I would think that there should be, in your paper, a little more discussion of how much function is being regarded as “function”. Since sequences can evolve by natural selection from less to more function, if the threshold for “function” is set high, then if there were some way of assessing FI, a mistaken impression would be made that this cannot be achieved by natural evolutionary processes.

That is a good point.

I will think about it and perhaps make a change. I chose that simulation because most people agree that natural selection can “fine tune” function, and improve it once it is already there. Though the actual simulation is pretty simplistic, the basic idea I think everyone thinks is possible.

I suppose by “low function” I mean the minimal function necessary for stepwise improvement to be possible that yields selective advantage. By “high function” I mean, the point at which further improvements are not possible or not beneficial enough to be selected. So perhaps, I need to rethink those labels, or at least define them more clearly.

Good point. Thanks for looking at.
Tom English on March 10, 2017 at 9:05 am said:

swamidass: I made the code available

Aargh! I’d missed that tab. Let me look at that before continuing. (The Jupyter Notebook output looks great. I’m going to have to stop putting off learning how to use it.)

For what it’s worth, here’s what happens with i.i.d. mutation of amino acids (not equivalent to mutation of bases). The per-AA MISA values for the full samples are 2.45363737635, 1.20048003939, 0.326272993445, 0.0742663743184.

from math import log import numpy as np import numpy.random as rand import matplotlib.pyplot as plt
H_MAX = log(20, 2) def H(dist): assert sum(dist) > .9999 and sum(dist) < 1.0001 result = 0.0 for p in dist: if p > 0.0: result += p * log(p, 2) return -result
def sample1_iid(mutation_rate, length=150, sample_size=10**4): mutated = rand.sample((sample_size, length)) < mutation_rate mutants = rand.choice(range(1, 20), (sample_size, length)) sample = np.where(mutated, mutants, 0) sample[0, :] = 0 entropy = np.zeros(sample_size) for i in xrange(sample_size): for j in xrange(length): entropy[i] += H(np.bincount(sample[:i+1, j], minlength=20) / float(i+1)) return sample, length * H_MAX - entropy
keiths on March 10, 2017 at 9:17 am said:

swamidass:

Of course, it is always possible an error is made, but I made the code available.

Tom:

Aargh! I’d missed that tab.

He also mentioned it in the OP:

We went to great lengths to make the source code for the simulations available in an easy to read and annotated format.
Tom English on March 10, 2017 at 11:20 am said:

keiths,

Thank you for catching my error. Perhaps you could mention something I got right. You’re good at information theory, and I’m sure you recognized that I correctly identified MISA as relative entropy.
Tom English on March 10, 2017 at 11:33 am said:

Aargh! I checked NumPy for the entropy function, but forgot to check SciPy. I always look for readymade code. As it happens, scipy.stats.entropy calculates relative entropy, if you supply it with two distributions instead of one. Here I’ve modified my code to calculate “MISA” as relative entropy. The results are unchanged. I’ll try doing the same with the Matlock-Swamidass code.

from math import log from scipy.stats import entropy import numpy as np import numpy.random as rand import matplotlib.pyplot as plt
H_MAX = log(20, 2) UNIFORM = np.ones(20) / 20.0
def sample1_iid(mutation_rate, length=150, sample_size=10**4): mutated = rand.sample((sample_size, length)) < mutation_rate mutants = rand.choice(range(1, 20), (sample_size, length)) sample = np.where(mutated, mutants, 0) sample[0, :] = 0 misa = np.zeros(sample_size) counts = np.zeros((length, 20), dtype=int) for n in xrange(sample_size): for i in xrange(length): counts[i] += np.bincount(sample[n:n+1, i], minlength=20) misa[n] += entropy(counts[i], UNIFORM, base=2) return sample, misa #return sample, length * H_MAX - entropy
stcordova on March 10, 2017 at 3:01 pm said:

I really want to thank Dr. Swamidass for this paper, even though I feel bad for having to disagree with my friend Kirk on some points, I don’t think Kirk’s work will be in vain, it just needs a different application.

I have to agree that MISA will overestimate the improbability of forming a given function. Computational biophysics that looks at the functional requirements in a 3D scenario and then tries to compute what residues as a matter of principle have to be where is about the best way of arriving at an estimate of improbability of forming a specific domain. I think trying to analyze a protein will not be very rewarding, but rather individual discrete components as is already done in industry, such as domains and secondary structures (like alpha helicies).

I mentioned earlier the zinc-finger domain, and there are so many other domains out there, as well as a catalog of well known secondary structures like alpha-helecies and beta sheets, etc. Though an alpha helix is low information content since so many given amino acids can form it, for a very long alpha helix that must exclude proline as matter of principle, there is a small amount of information content. Also, such as in the case of forming integral or transmembrane proteins, a certain minimum amount of amphiphatic 3D separation of hydrophobic and hydrophilic residues must be present before it even function well enough for selection to even possibly optimize it. So there is some information content there as well. This is the approach that can be taken to estimate functional improbability (I cringe at using the word “information”). Behe used this decompositional approach in Edge of Evolution.

The 3D computational biophysics analysis is a better approach to estimating functional improbability, but Kirk’s is fundamental 1-dimensional, looking only at the primary structure (sequences) of the proteins. So I don’t think it can be effective, and hence I’m forced to agree with the problem Dr. Swamidass posed regarding tertiary homology even though there is essentially no sequence homology.

However, what MISA is measuring does provide something valuable to IDists and structural biologists. For a long time, deeply conserved sequences have indeed been a pointer to likely critical biological structures in proteins. A well-known example is that deep conservation is practically a road-sign to active catalytic sites in enzymes. It would not surprise me if is a pointer to other extremely important structural characteristics of proteins. This potentially provides a tool for Bill Dembski’s vision of steganography.

FWIW, the ENCODE project or any project that relies heavily on measuring sequence conservation to identify function are implicitly measuring the same thing Kirk is measuring. Kirk uses the quantity in a way I don’t think will achieve what he’s aiming at (namely functional improbability), but I think it could be used in another way that is highly productive. The reason I believe that is because it has been done implicitly for decades by those in the field that is now known as structural biology.
keiths on March 10, 2017 at 3:22 pm said:

It only takes a minute or two to read the OP, Tom.
stcordova on March 10, 2017 at 3:24 pm said:

To give a more detailed example of how I might try to flip around Kirk’s MISA measure, consider the Lypoyl doamain. Notice the K (lysine) that is common to so many species. It’s just stands out there and screams and is highlighted in yellow column of “K” (lysines) in the link below (scroll to the bottom).

https://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=cd06849

Of course some will say that’s solely the result of selection, but I’m skeptical of that. But that is not relevant immediately as it points to an astonishing conserved feature.

Sure enough, I looked to see if there is any critical residue in the lypoyl domain, and TADA it is that very position where the co-factor lipoate amide linkage is connected in pyruvate dehydrogenase (PDH) complex. It’s like God practically saying, “look here”!

In the link they refer to this high powered MISA site as “feature 1”:

In the link they say it thusly:

Feature 1:lipoyl attachment site [posttranslational modification site]

Evidence:
Structure:2PNR; Lipoamide-containing lipoyl domain 2 (L2) of human pyruvate dehydrogenase kinase 3

– View structure with Cn3D

Unfortunately, Kirk’s score would only give a measily 2 bit MISA FI, but maybe the metric has to be reworked and it will be apparent the correct metric to apply it in the way that is so clearly obvious as shown by the yellow highlight!

Some will say, “that only proves selection.” No it doesn’t if there is insufficient population resources to enable the formation and preservation of such conservation (exactly the current conflict between ENCODE and the evolutionary biologists).
stcordova on March 10, 2017 at 4:47 pm said:

Unfortunately, Kirk’s score would only give a measily 2 bit MISA FI

Correction, I think it’s 4.32 bits [ -log2(1/20) ] . 2 bits per position is for nucleotides [ -log2(1/4) ] it’s 4.32 bits per residue for amino acids. Apologies to the readers.
swamidass on March 10, 2017 at 5:39 pm said:

stcordova: The 3D computational biophysics analysis is a better approach to estimating functional improbability,

Unfortunately, this has even more problems than Kirk’s approach. It is subject many of the same limitations as MISA, and also its own set of problems. Until we solve the protein-folding problem and the structure-function problem for all of sequence space, we cannot compute the number of functional sequences. There would be SEVERAL nobel prizes to the people who solve this problem. I think it is intractable.

Besides, this misses the point any ways. The number of functional sequences tells us NOTHING about evolvability. The much more important question is how close a new function is to extant sequences. We find that new functions are close to extant sequences. These cases require very low input of information. Because they start from the wrong framework, that new functions MUST require high amounts of information, this is all dismissed as “not explaining how high amounts of FI are arrived, at just examples of microevolution”.

But that is exactly the point. We have no evidence that evolution needs anything more than “micro”evolution to work, if this is what they mean by microevolution.
swamidass on March 10, 2017 at 6:06 pm said:

stcordova: I really want to thank Dr. Swamidass for this paper, even though I feel bad for having to disagree with my friend Kirk on some points, I don’t think Kirk’s work will be in vain, it just needs a different application.

His work is important. It demonstrates that there a very high amount of information in biological sequences. He misidentifies the reason for the information (by not considering any other sources except functional constraints.)

I’ve shown here that a large amount of that information is a record of history. That information is what give us such strong evidence for common ancestry, and this is the core claim of evolution. This is information content, and its patterns across biology, is actually strong evidence fo evolution.
colewd on March 10, 2017 at 6:26 pm said:

swamidass,

Besides, this misses the point any ways. The number of functional sequences tells us NOTHING about evolvability. The much more important question is how close a new function is to extant sequences. We find that new functions are close to extant sequences. These cases require very low input of information. Because they start from the wrong framework, that new functions MUST require high amounts of information, this is all dismissed as “not explaining how high amounts of FI are arrived, at just examples of microevolution”.

I agree with this. Do you think you can make a case for universal common descent? There are many transitions where lots of molecular novelty appear like the prokaryotic to eukaryotic transition or the first multicellular life.
Allan Miller on March 10, 2017 at 7:01 pm said:

colewd,

I agree with this. Do you think you can make a case for universal common descent? There are many transitions where lots of molecular novelty appear like the prokaryotic to eukaryotic transition or the first multicellular life.

But there is no particular reason to suppose that this did not occur by the rather routine probing of protein neighbourhoods, by mechanisms that include domain transposition as well as point mutation. The extinction of contemporary lineages is all that is required to make a ‘transition’ appear sudden and rich in change.
Tom English on March 10, 2017 at 7:30 pm said:

Allan Miller: The extinction of contemporary lineages is all that is required to make a ‘transition’ appear sudden and rich in change.

Well said.
Tom English on March 10, 2017 at 7:48 pm said:

I changed my code very little (see the italics in two consecutive lines below) to replicate the first simulation experiment of Matlock and Swamidass. There is no way to tell from the text that (1) mutations are i.i.d. Poisson [actually, there’s a “mutation” whenever the variate is greater than 0] and (2) “mutation” does not always change an amino acid. The caption of Figure 1 describes the process poorly: “Here, extant sequences 150 amino acids long, and a fixed number of mutations away from a single ancestral sequence are sampled…”

from math import log from scipy.stats import entropy import numpy as np import numpy.random as rand import matplotlib.pyplot as plt
H_MAX = log(20, 2) UNIFORM = np.ones(20) / 20.0
def sample1(mutation_rate, length=150, sample_size=10**4): mutated = rand.poisson(mutation_rate, (sample_size, length)) > 0 mutants = rand.choice(range(0, 20), (sample_size, length)) sample = np.where(mutated, mutants, 0) sample[0, :] = 0 entropies = np.zeros(sample_size) counts = np.zeros((length, 20), dtype=int) for n in xrange(sample_size): for i in xrange(length): counts[i] += np.bincount(sample[n:n+1, i], minlength=20) entropies[n] += entropy(counts[i], base=2) return sample, length * H_MAX - entropies
GlenDavidson on March 10, 2017 at 8:27 pm said:

colewd: Do you think you can make a case for universal common descent?

Do you think you can make a case against it?

More importantly, do you think you could make a decent case for design? Without using the fallacy of the false dilemma?

It would be interesting if you could, and even if you could acknowledge that you need actual evidence in favor of design in order to conclude design. No science exists without evidence for the model or theory, which is why ID is no science.

Glen Davidson
J-Mac on March 10, 2017 at 11:29 pm said:

Who determines which information is valid?

If the information is not valid according to the current standards,who does the job belong to we would like to know?
keiths on March 11, 2017 at 12:40 am said:

swamidass,

Let me expand on my earlier comment about how volume is far less important than extent when assessing the “size” of genotype networks.

You and Matlock write:

Evolving a specific function may be easy if a large number of proteins and protein families are just a few mutational steps away from the new function.

Wagner’s lab has investigated this question. From Arrival of the Fittest:

Evandro focused on enzymes because they are an extremely diverse group of proteins— no surprise, since they catalyze more than five thousand different chemical reactions. They are also especially well studied: Thousands of them scattered throughout the library have been mapped. Their locations are precisely known, and we can use computers to analyze them. Evandro asked his computer to choose a pair of proteins with the same fold, but in different places on the same genotype network. He then explored a small neighborhood around the first protein, and listed all known proteins in it, together with their function. After that, he explored the neighborhood of the second protein, and listed all known proteins and their functions in its neighborhood. Finally, he compared these lists, asking simply whether they were different, whether proteins in the two neighborhoods had different functions. He then chose another protein pair, yet another pair, and so on, asking the same question for them, until he had explored hundreds of pairs and their neighborhoods.

The final answer was simple. The neighborhoods of two proteins contain mostly different functions, even if the two proteins are close together in the library. For instance, even proteins that differ in fewer than 20 percent of their amino acids have neighborhoods whose proteins differ in most of their functions. The protein library has neighborhoods that are highly diverse, just like the metabolic library. And just as with metabolism, this diversity makes vast genotype networks ideal for exploring the library, helping populations to discover texts with new meaning while preserving old and useful meaning.
keiths on March 11, 2017 at 12:44 am said:

Here’s their paper:

Evolutionary Innovations and the Organization of Protein Functions in Genotype Space

Abstract

The organization of protein structures in protein genotype space is well studied. The same does not hold for protein functions, whose organization is important to understand how novel protein functions can arise through blind evolutionary searches of sequence space. In systems other than proteins, two organizational features of genotype space facilitate phenotypic innovation. The first is that genotypes with the same phenotype form vast and connected genotype networks. The second is that different neighborhoods in this space contain different novel phenotypes. We here characterize the organization of enzymatic functions in protein genotype space, using a data set of more than 30,000 proteins with known structure and function. We show that different neighborhoods of genotype space contain proteins with very different functions. This property both facilitates evolutionary innovation through exploration of a genotype network, and it constrains the evolution of novel phenotypes. The phenotypic diversity of different neighborhoods is caused by the fact that some functions can be carried out by multiple structures. We show that the space of protein functions is not homogeneous, and different genotype neighborhoods tend to contain a different spectrum of functions, whose diversity increases with increasing distance of these neighborhoods in sequence space. Whether a protein with a given function can evolve specific new functions is thus determined by the protein’s location in sequence space.
Mung on March 11, 2017 at 1:48 am said:

The same does not hold for protein functions, whose organization is important to understand how novel protein functions can arise through blind evolutionary searches of sequence space.

I’m going to pretend I didn’t see that.
keiths on March 11, 2017 at 3:04 am said:

Why?
AhmedKiaan on March 11, 2017 at 3:11 am said:

“colewd: Do you think you can make a case for universal common descent?”

derp.

you ID Creationist guys talk a lot about biology but you never seem to crack a textbook.
swamidass on March 11, 2017 at 4:18 am said:

keiths: Here’s their paper:

Evolutionary Innovations and the Organization of Protein Functions in Genotype Space

Great find. Thanks for that. Revision of paper will include this reference for sure.
swamidass on March 11, 2017 at 4:23 am said:

Tom English: I changed my code very little (see the italics in two consecutive lines below) to replicate the first simulation experiment of Matlock and Swamidass. There is no way to tell from the text that (1) mutations are i.i.d. Poisson [actually, there’s a “mutation” whenever the variate is greater than 0] and (2) “mutation” does not always change an amino acid. The caption of Figure 1 describes the process poorly: “Here, extant sequences 150 amino acids long, and a fixed number of mutations away from a single ancestral sequence are sampled…”

This is a good point. We need to clarify the text or change the simulation to match more clearly. This does not actually affect the simulation much, and mainly just speeds up the computation. Regardless, it is a slight difference, so we should bring that in line. It just alters a little bit how quickly things converge.

Very glad you looked it and raised that to my attention again. I forget these details sometimes when I am not the one implementing the code.
stcordova on March 11, 2017 at 9:48 am said:

swamidass:

We find that new functions are close to extant sequences.

Only sometimes, but not in the case for orphans. You probably were aware of Paul Nelson and Richard Buggs publication in Next Generation Systematics.

http://www.cambridge.org/catalogue/catalogue.asp?isbn=9781107028586

Buggs also was published in Nature just last year that shows specific orphans:

https://natureecoevocommunity.nature.com/users/24561-richard-buggs/posts/14227-the-unsolved-evolutionary-conundrum-of-orphan-genes

This week in Nature I and my co-authors published the ash tree genome. Within it we found 38,852 protein-coding genes. Of these one quarter (9,604) were unique to ash. On the basis of our research so far, I cannot suggest shared evolutionary ancestry for these genes …..

Orphan genes are found every time a new genome is sequenced. Their ubiquity has been one of the biggest surprises of genomics over the last 20 years. Many researchers had hypothesised that the number of orphan genes found would steadily diminish as more and more genomes were sequenced – but this is not the case. Orphan genes continue to comprise a sizeable proportion of each new genome sequenced. I and Paul Nelson reviewed this topic in a chapter of a book published this year by Cambridge University Press: “Next Generation Systematics”.

Orphan genes are “the hard problem” for evolutionary genomics. Because we can’t find other genes similar to them in other species, we can’t build family trees for them
OMagain on March 11, 2017 at 10:16 am said:

stcordova: Only sometimes, but not in the case for orphans

And therefore….
Rumraket on March 11, 2017 at 12:10 pm said:

Their nucleotide sequences are almost never unique, and I only say almost because there might exceptions to what I know. I don’t know of a single case of a protein coding gene in any species that does not have a homologous DNA sequence in other species. The amino acid sequences might be unique and lack sequence similarity to any other proteins, but that doesn’t mean what Sal is trying to imply. He obviously wants to insinuate these proteins could not have evolved, but he forgets to consider whether they evolved from noncoding DNA, rather than from other proteins.

And then there’s the fact that these putative orfan genes are usually identified by searching for open reading frames in DNA sequence (but then homologies are searched for using predicted amino acid sequence), but upon further analysis turn out to not actually be genes since they’re never translated.

Further still, if also often turns out that even if they’re both transcribed and translated, they’re the products spurious transcription and translation, exist at extremely low levels and actually aren’t functional proteins. Or at least, it isn’t known whether they ever get translated at a physiologically significant level that would indicate they’re functional.

All these hurdles need to be passed before we can conclude they’re truly ORFan, functional protein coding genes.

You’ll never hear Sal giving such a list of caveats. He’s only interested in advertising for a particular conclusion and things that might cast doubt on that conclusion never merits a mention for him. It’s a form of dishonesty. It’s actually also a fallacy in inductive logic (fallacy of exclusion) since it violates the principle of total evidence.

Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA.
Schmitz JF, Bornberg-Bauer E.

Abstract
Over the last few years, there has been an increasing amount of evidence for the de novo emergence of protein-coding genes, i.e. out of non-coding DNA. Here, we review the current literature and summarize the state of the field. We focus specifically on open questions and challenges in the study of de novo protein-coding genes such as the identification and verification of de novo-emerged genes. The greatest obstacle to date is the lack of high-quality genomic data with very short divergence times which could help precisely pin down the location of origin of a de novo gene. We conclude that, while there is plenty of evidence from a genetics perspective, there is a lack of functional studies of bona fide de novo genes and almost no knowledge about protein structures and how they come about during the emergence of de novo protein-coding genes. We suggest that future studies should concentrate on the functional and structural characterization of de novo protein-coding genes as well as the detailed study of the emergence of functional de novo protein-coding genes.
Mung on March 11, 2017 at 3:57 pm said:

Durston, quite clearly, is talking about a particular kind of evolutionary process. He’s also clear about what he means by impossible. There is no need to misrepresent him other than a lack of good will.

And am I the only one who finds it odd that swamidass, as a theistic evolutionist, is arguing that a blind and mindless evolutionary process can do what God cannot do?

Only at TSZ. 🙂
stcordova on March 11, 2017 at 4:51 pm said:

. I don’t know of a single case of a protein coding gene in any species that does not have a homologous DNA sequence in other species.

The problem is that they are homologous in ways that don’t help resolving major macro-evolutionary transitions. Insulin is a characteristic of vertebrates, it is homologous in a lot of vertebrates, but we don’t see them in bacteria or yeast. So that sort of homology doesn’t help the case for evolvability, it actually hurts it since it looks even more like a poof out of nowhere because of which organisms share the homology and which don’t.

So insulin just sort of “poofs” on the scene with all the machinery like cell receptors and all sorts of regulatory apparatus to support it. We know if insulin is misregulated on the excess side, is fatal. So the homology argument and emphasizing similarity in extant species doesn’t help much if the homology is in the wrong places.

But going back to Dr. Swamidass’s paper, I think it is an excellent critique of MISA and FI. I hope Tom’s comments can improve the paper beyond where it is already.

That said, Kirk’s MISA measure, is already implicitly used in identifying functional domains.

https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

It’s not like Kirk is going into territory that is totally foreign. MISA formalizes something that is already informally accepted. I think the issue is whether MISA describes improbability of forming function, or functional information. I have my reservations about that, but it could be a more appropriate tool for another application such as already being practiced by industry as attested by the Conserved Domain databases.
Allan Miller on March 11, 2017 at 5:28 pm said:

stcordova,

Insulin is a characteristic of vertebrates, it is homologous in a lot of vertebrates, but we don’t see them in bacteria or yeast. So that sort of homology doesn’t help the case for evolvability, it actually hurts it since it looks even more like a poof out of nowhere because of which organisms share the homology and which don’t.

It doesn’t hurt it at all. There is a simple mechanism by which an accumulation of small changes can look like a poof out of nowhere when it isn’t. You need to exclude that.

Further, ‘looking like a poof out of nowhere’, when you find homology in the DNA sequences, looks like a poof out of nowhere only to someone who is determined to find something they can say looks like a poof out of nowhere.

Nor is this concentration on the noise and ignoring the signal much use to YEC, which sees entire organisms poofing out of nowhere. The genomes say no.
colewd on March 11, 2017 at 6:18 pm said:

swamidass,

Moreover, the pervasive observation of multifunctional proteins suggests that functions are actually very close to one another and abundant. Multifunctional proteins would be impossible if the FI argument against evolution were true.

This is taken as an assumption based on observation. Their must be differences in sequences in order to bind consistently to the right proteins. Your claim is “very close”. Does that mean alignment except for a few AA’s. Do you think if we look at nuclear proteins that facilitate the cell cycle this is what we will observe?
Rumraket on March 11, 2017 at 6:21 pm said:

stcordova: The problem is that they are homologous in ways that don’t help resolving major macro-evolutionary transitions. Insulin is a characteristic of vertebrates, it is homologous in a lot of vertebrates, but we don’t see them in bacteria or yeast.

Maybe so, but that might be because they’re so distantly related there’s no sequence similarity left to go by and as such, we need to solve crystal structures to detect distant homologs.

They go beyond vertebrates though: http://pfam.xfam.org/family/PF00049.16

The insulin/IGF/relaxin family is a group of evolutionary related proteins which possess a variety of hormonal activities.[1] Family members in human include two subfamilies:
1) insulin and insulin-like growth factors[2]
2) relaxin family peptides:
relaxins 1 and 2
relaxin 3
Leydig cell-specific insulin-like peptide (gene INSL3)[3]
early placenta insulin-like peptide (ELIP) (gene INSL4)[4]
insulin-like peptide 5 (gene INSL5)
insulin-like peptide 6

Structure
These proteins are characterized by having three disulfide bonds in a characteristic motif. Some family members have an additional disulfide bond also in a conserved location. All of these proteins have a helical segment (corresponding to B chain in insulin) followed by a variable-length chain, followed by a domain (A chain in insulin) with two helices pinned against each other via a disulfide bond. These two regions are linked by two or three disulfide bonds.

Amongst the different proteins in the family, very little of the sequence is conserved except for the disulfide bonds. The variable-length chains may exhibit large inter-species variation even when the remainder of the sequence is highly conserved; and as is in the case of insulin, sometimes the variable length chain is cleaved out by secretory endoproteases, leaving a two-chain protein held together by disulfide bonds.

Click the InterPro tab and you get this:

InterPro entry IPR016179

The insulin family of proteins groups together several evolutionarily related active peptides [PUBMED:6107857]: these include insulin [PUBMED:6243748, PUBMED:503234], relaxin [PUBMED:10601981, PUBMED:8735594], insect prothoracicotropic hormone (bombyxin) [PUBMED:8683595], insulin-like growth factors (IGF1 and IGF2) [PUBMED:2036417, PUBMED:1319992], mammalian Leydig cell-specific insulin-like peptide (gene INSL3), early placenta insulin-like peptide (ELIP) (gene INSL4), locust insulin-related peptide (LIRP), molluscan insulin-related peptides (MIP), and Caenorhabditis elegans insulin-like peptides. The 3D structures of a number of family members have been determined [PUBMED:2036417, PUBMED:1319992, PUBMED:9141131]. The fold comprises two polypeptide chains (A and B) linked by two disulphide bonds: all share a conserved arrangement of 4 cysteines in their A chain, the first of which is linked by a disulphide bond to the third, while the second and fourth are linked by interchain disulphide bonds to cysteines in the B chain.

Insects, mollusc, C. elegans? To someone like you that might just push the problem further back, but eventually you have to concede that at those distances, the similarity signal is expected to be lost anyway both because the entire sequence gets rewritten, or that none of the lineages carrying homologues have survived to the present. All of those options are consistent with macroevolution and common descent.
colewd on March 11, 2017 at 6:23 pm said:

Allan Miller,

It doesn’t hurt it at all. There is a simple mechanism by which an accumulation of small changes can look like a poof out of nowhere when it isn’t. You need to exclude that.

On what basis are you claiming that data which is contradictory to Dr. Swamidass hypothesis is not contradictory or problematic.
Rumraket on March 11, 2017 at 6:30 pm said:

Mung: Durston, quite clearly, is talking about a particular kind of evolutionary process. He’s also clear about what he means by impossible. There is no need to misrepresent him other than a lack of good will.

Can you explain in your own words, what “particular kind” of evolutionary process Kirk Durston has shown is impossible, and what he means by impossible?

And am I the only one who finds it odd that swamidass, as a theistic evolutionist, is arguing that a blind and mindless evolutionary process can do what God cannot do?

He isn’t arguing that. Rather, he seems to be arguing that God isn’t required for certain evolutionary transitions to be plausible, or to take place. This is different from claiming God can’t do them. Rather, there’s no good reason to invoke him for this particular problem (in the same way you wouldn’t need to invoke God to explain planetary orbits, or the ballistic trajectory of cannon balls).

To quote the saying commonly attributed to LaPlace: “I had no need of that hypothesis”.
Rumraket on March 11, 2017 at 6:32 pm said:

colewd: On what basis are you claiming that data which is contradictory to Dr. Swamidass hypothesis is not contradictory or problematic.

That a simple and observed mechanism accounts for it. Time and extinction.
colewd on March 11, 2017 at 6:37 pm said:

Rumraket,

That a simple and observed mechanism accounts for it. Time and extinction.

How would you falsify his hypothesis?
GlenDavidson on March 11, 2017 at 6:49 pm said:

colewd:
Rumraket,

How would you falsify his hypothesis?

Observe evidence for other mechanisms that operated in the past that could account for it better.

Is that so hard to fathom?

Have you observed evidence for mechanisms that operated in the past that could account for it? Poofs, Godly actions, etc.? Actual evidence for design–not your fantastic privileging of evidenceless design as the “only alternative”? No? Then, presently understood causes are not falsified.

Glen Davidson
stcordova on March 11, 2017 at 6:56 pm said:

Allan Miller:

There is a simple mechanism by which an accumulation of small changes can look like a poof out of nowhere when it isn’t. You need to exclude that.

Like the ancestor having the gene and then some extant organisms losing it while others retain it?

Ok, so ancestor of yeast and vertebrates had insulin proteins? The “mechanism” only looks good superficially in phylogenetic explanations until the details are considered from a mechanical standpoint.

The evolvability of insulin entails not just the insulin proteins but the machinery that makes sense of the insulin protein. We created transgenic yeast and bacteria with insulin in the process of developing treatments for diabetes. Did yeast and bacteria with insulin genes have much use for the insulin? Doubtful, and even if they did use the insulin, it certainly would not be in the way vertebrates use insulin.

There are better ways to argue functional improbability and un-evolvability than MISA. MISA is good for ID’s goal of identifying steganography. The patterns of diversity and similarity are not evidence of common descent, imho, but are partly there to create scientific discoverability. It is an intelligently designed feature.

The lysine residue on the Lipoyl domain looked 95%+ conserved which is contrasted with the lack of conservation on the neighboring positions, it was a dead giveaway to pointing out the critical molecule which I highlight here in yellow for the molecule and in yellow for the “MISA” region where the “K” (lysine) is highlighted (below).

Now one might complain, “that’s just one lousy amino acid residue, nothing to crow about.” But the issue is all the machinery that makes that one residue highly important. It is a critical site where a key co-factor (lipoic acid) has to attach. It is a highly conserved single residue!

You can see the power of the MISA in the yellow highlight of the sequences on left of the figure below and the resulting identification of the corresponding critical lysine “K” molecule highlighted in yellow in the right side of the diagram below.

You can get more details here:
https://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=133458
Allan Miller on March 11, 2017 at 6:57 pm said:

colewd,

On what basis are you claiming that data which is contradictory to Dr. Swamidass hypothesis is not contradictory or problematic.

You seem to be labouring under the misapprehension that data which is not supportive of a hypothesis is antithetical to it. Labour no more.
stcordova on March 11, 2017 at 7:06 pm said:

Rumraket argues his point by citing literature which says:

Amongst the different proteins in the family, very little of the sequence is conserved.

Uh, Rumraket, are you sure you want to highlight the problems with your hypothesis such as lack of sequence conservation as proof of your claim of evolvability? 🙂
Rumraket on March 11, 2017 at 7:10 pm said:

colewd: How would you falsify his hypothesis?

Parsimony. I guess you could say that, in a sense, there’s a point at which invoking things like erosion of phylogenetic signal due to time and extinction is being called upon to account for so much data it overwhelms the main signal (not sure I got this right, someone tell me if this doesn’t make sense).

Or you could have all the orfan gene families converge on the same point in time (basically, the whole organism looks like it “popped up”, as opposed to a small subset of it’s genes), and if there was no other data indicating evolution happened before this time, you then technically wouldn’t have any evidence that evolution produced that particular set of genes, or that the organism was at that stage the result of an evolutionary process.
Allan Miller on March 11, 2017 at 7:11 pm said:

stcordova,

Like the ancestor having the gene and then some extant organisms losing it while others retain it?

No. I’d hoped you’d ask rather than go off at a tangent on what you thought I might have meant.

Here’s an illustration of the mechanism I alluded to.

Take a sheet of paper. Photocopy it. Photocopy the photocopies. Do this for a few ‘generations’. From time to time, make marks on random sheets. Given that we are illustrating the relationship between small change and retrospective determination of ‘poof’, you are only allowed to make one mark on any given sheet.

At the end of a period, score the sheets according to the number of differences between them. Take the copies with the greatest difference, and destroy all the others.

In that remaining dataset, are we looking at ‘poof’ – a massive number of differences happening all at once – or increment?
Rumraket on March 11, 2017 at 7:13 pm said:

stcordova:
Rumraket argues his point by citing literature which says:

Uh, Rumraket, are you sure you want to highlight the problems with your hypothesis such as lack of sequence conservation as proof of your claim of evolvability?

This is literally the opposite of a problem. The fact that the sequence can change so much yet retain structure and function, is what testifies to the possibility of long-term evolutionary change, and the plausibility of gradual erasure of sequence-based phylogenetic signal.
Allan Miller on March 11, 2017 at 7:15 pm said:

stcordova,

Uh, Rumraket, are you sure you want to highlight the problems with your hypothesis such as lack of sequence conservation as proof of your claim of evolvability?

You yourself, upthread, have agreed that structural conservation can remain even with the complete loss of sequence signal.
colewd on March 11, 2017 at 7:22 pm said:

Allan Miller,

You seem to be labouring under the misapprehension that data which is not supportive of a hypothesis is antithetical to it. Labour no more.

How do you deal with evidence that does not support the hypothesis? At what point does it become problematic?
Allan Miller on March 11, 2017 at 7:24 pm said:

colewd,

How do you deal with evidence that does not support the hypothesis? At what point does it become problematic?

At the point it opposes it. The hypothesis is not ‘every single sequence can be determined to be related to every other’.
colewd on March 11, 2017 at 7:33 pm said:

Allan Miller,

At the point it opposes it. The hypothesis is not ‘every single sequence can be determined to be related to every other’.

Are you arguing that Dr Swamidass has made an untestable claim?
stcordova on March 11, 2017 at 7:35 pm said:

Allan Miller:

You yourself, upthread, have agreed that structural conservation can remain even with the complete loss of sequence signal.

Absolutely I said so! The eye form is “conserved” (ahem, converged) between the octopus and human, but they aren’t related by common descent.

Like the eye form, at the molecular scale, structural form (aka secondary, and teriary structures) can be highly similar but the way the structures are built in terms of sequences can be highly dissimilar.

Similarity of teriary structure is not evidence of common descent. For that matter similarity of secondary structure is not evidence of common descent. Example, the secondary structure known as the alpha helix in proteins (depicted bellow) can be implemented by a buzzilion evolutionarily unrelated sequences. All you need is having amino acids that have about the same Ramachandaran plot that are alanine-like.

By way of extension, it is not hard to see it is also possible tertiary structures can be “homologous” (in the Owenian sense, not Darwinian sense) without proceeding from a common ancestor.

So convergence on a common tertiary structure is no more an evidence of common ancestry than the convergence of similar eyes in humans and octopus is evidence their eyes came from the same common ancestor who had that eye.