How to calculate amino acid sequence space

I see long-time commenter at Uncommon Descent, Mung, in a thread entitled Backwards eye wiring? Lee Spetner comments, asks:

How do you calculate the size of amino acid sequence space?

As this seems somewhat off-topic there, I thought I’d attempt to answer Mung’s question. I’ll try and be brief. The two most fascinating biochemicals are nucleic acids (RNA and DNA) and proteins. Proteins seem ubiquitous in cellular systems; they function as catalysts (enzymes), structural elements (keratin, collagen), signal molecules (hormones, pheromones), binding agents (antibodies). Proteins are linear sequences of amino acids joined by a condensation (called so because a molecule of water is lost) reaction forming a peptide bond. There are twenty-one amino-acids found in eukaryotes and twenty of them are directly represented in the genetic code. The special case is selenocysteine which is coded indirectly and I’ll leave that out of the calculation for the sake of simplicity.

So what number of different amino acid sequences could theoretically exist, given twenty possibilities for each aa in the polymer. I guess we shouldn’t count twenty monomers. For dimers, there are 400 possibilities. For trimers, we have have 8,000 and so on. The general formula for the number of theoretically possible different protein sequences of length n is 20^n. So the answer for all possible sequences is the sum of this calculation from n=2 to, well, what? There are some very large proteins; titin being the largest known at around 30,000 aa’s. So I guess we should sum at least to that number.

This is a very big number indeed! I leave it as an exercise for the reader to try representing the number that results when taking the upper limit of n as 30,000. 🙂

Now I’ve answered Mung’s question, would he like to enlarge on what it signifies?

ETA categories and remove tautology

ETA 2 correction 20^n not n^{20} (hat tip Joe Felsenstein)

182 thoughts on “How to calculate amino acid sequence space

  1. Um, actually not n^{20} but 20 \times 20 \times 20 \times \dots \times 20 which is 20^n.

    As Alan indicates this is a very very big number. Of course any two sequences are no more than n amino acid replacements apart. In that sense the space does not look anything like a two- or three-dimensional space. It is closer in structure to the space of acquaintanceships among people, where the 300 million people in the U.S. are on average only 6 “degrees of separation” apart. It is impossible to draw a space like this in any simple way that is not visually confusing.

  2. Worth noting that calculations based upon 20 distinct ‘letters’ are misleading. Amino acids group on chemical property – some acids simply have different lengths of side chain for example, but the groups capping those chains are the same, which renders a hydropobic residue (say) substitutable by another at many of its sites. It introduces a small conformational kink due to steric interactions, but these are swamped by the bulk of the peptide chain, which continues to fold much the same either way.

    The relevance of this can be noted by reference to another favourite trope of the ‘design’ school, the fault tolerance of the genetic code. Misreads tend to fall into a neighbourhood which has the same chemical property as the ‘correct’ acid. If proteins were as inviolably specific as implied by v^n calculations, this property of the code(/”code”!) would not exist – all misreads would be duds.

    This is not true at every residue, of course – certain sites are more fixed than others. Either way, bumping the numbers up only matters if you are Fred Hoyle, or one of his disciples.

  3. Allan Miller:
    Worth noting that calculations based upon 20 distinct ‘letters’ are misleading. Amino acids group on chemical property – some acids simply have different lengths of side chain for example, but the groups capping those chains are the same, which renders a hydropobic residue (say) substitutable by another at many of its sites. It introduces a small conformational kink due to steric interactions, but these are swamped by the bulk of the peptide chain, which continues to fold much the same either way.

    The relevance of this can be noted by reference to another favourite trope of the ‘design’ school, the fault tolerance of the genetic code. Misreads tend to fall into a neighbourhood which has the same chemical property as the ‘correct’ acid. If proteins were as inviolably specific as implied by v^n calculations, this property of the code(/”code”!) would not exist – all misreads would be duds.

    This is not true at every residue, of course – certain sites are more fixed than others. Either way, bumping the numbers up only matters if you are Fred Hoyle, or one of his disciples.

    Also, with regards to the hypothetical origins of translation, the genetic code, and the first proteins – it needs to be noted that not only would, assuming the amino acids are first synthesized abiotically, the available amino acid alphabet have been smaller (see this), they would not have existed in equal amounts either. So at any given site of a residue in a hypothetical primordial “first peptide”, the individual acids are not equiprobable.
    That means the kind of peptide that will spontaneously assemble (or perhaps be “randomly translated”) from such a mixture will at any given site be significantly more likely to have a Glycine-residue, than an Alanine residue. And Alanine will be more likely than Arginine, which will be more likely than Valine and so on. The paper I linked gives the estimated relative abundances.

    With this in mind, it becomes significantly harder to calculate the probability of some given peptide sequence, while it also significantly biases the types of outcomes that are bound to predominate.

    Here’s an extremely intersting fact to consider the implications of: In phylogenetic analysis of the oldest conserved protein domains in life, there is a statistically significant bias in the relative abundance in distribution of amino acids as the one seen to result in experiments performed in prebiotic synthesis.
    Do I need to point out the implication of this? I will: Translation originated in some kind of environment where amino acids were most probably synthesized abiologically.
    Also, and it’s been a while since I read something on this so I might be misremembering something, but I seem to remember that there is something about the genetic code itself (which codons are assigned to which amino acids) that implied the earlier stages of the code contained less codons, and that those codons coded for the same amino acids which are most frequently seen synthesized in various abiotic settings.

    This also implies, I think, that if the translational apperatus originated in some kind of RNA-world perhaps augmented with some very small peptides (big if, I admit), those peptides should be small enough and be of such amino acid sequences that their eventual reproduction through random translation was probable. Which implies they’d also contain mostly glycines, Alanines and so on.

    It would be very intersting to see if ribosomal translation could be made to work without ribosomal proteins, or perhaps if the ribosome is functionally dependent on the presence of some proteins, what their sequences are.

  4. Rumraket,

    Do I need to point out the implication of this? I will: Translation originated in some kind of environment where amino acids were most probably synthesized abiologically.

    Another possibility (not exclusive): biology tends to follow paths of least resistance, thermodynamically, those being the same paths as tend to be followed by abiotic synthesis.

    It is plausible that building blocks were originally externally synthesised, but equally, when biosynthesis commenced, it was likely to start with the easy ones.

    It would be very intersting to see if ribosomal translation could be made to work without ribosomal proteins […]

    It’s certainly plausible in principle, since protein does not come close to the peptidyl transferase centre. In fact, since present ribosomal proteins are ribosomally synthesised, it’s hard to avoid the conclusion that they are only essential now. A primitive scheme of non-ribosomally generated proteins superseded by takeover of ribosomal versions has a ‘hard problem’ of sequence conservation during the transition, which is avoided by assuming that protein is a secondary feature. But until we work out how to take them out without leaving a fatal ‘hole’, it’s conjecture.

  5. Allan Miller: Another possibility (not exclusive): biology tends to follow paths of least resistance, thermodynamically, those being the same paths as tend to be followed by abiotic synthesis.
    It is plausible that building blocks were originally externally synthesised, but equally, when biosynthesis commenced, it was likely to start with the easy ones.

    Yes but you see here is the disconnect, because to synthesize amino acids, you either synthesize them abiotically, or you use protein enzymes. It is not within the catalytic repertoire of RNA to make amino acids.
    I think on the face of this, an RNA-based translation system most probably, originally, worked by starting with randomly synthesizing peptides with abiotically produced amino acids.

    With regards to your latter point, I don’t agree with this part:

    Allan Miller: In fact, since present ribosomal proteins are ribosomally synthesised, it’s hard to avoid the conclusion that they are only essential now… A primitive scheme of non-ribosomally generated proteins superseded by takeover of ribosomal versions has a ‘hard problem’ of sequence conservation during the transition, which is avoided by assuming that protein is a secondary feature.

    It think it is rather easy to avoid that conclusion, since as I suggested, you could postulate that those ribosomal proteins began initially as small peptides (so I guess, strictly not “proteins”) with highly probable sequences (abundant with glycines, Arginines and so on) making them available in the environment, without being necessarily required to be made through translation.*
    Also, the structure of the ribosomal RNA itself could have been different, with the current one being dependent on the current proteins, and the primordial one being dependent on primordial peptides. In this sense, both the ribosomal RNA and it’s associated proteins co-evolved into what we see today, but were always co-dependent. What makes you think the scenario you suggest is more likely?

    * After all, that is the same kind of reasoning one would erect to argue that an RNA-world could even get started: Unless it was just some extremely rare statistical fluke, there must have been something about the first replicating RNA system that made it’s emergence probable. This implies that it was either very small, or that the environment was somehow constraining the outcomes of random combinatorial RNA polymer generation such that the kinds of outcomes that environment produced would be of the sort that were likely to contain a system that could self-replicate (and later start to make proteins through translation).

  6. Alan Fox: You’re right, Joe. I did get the dimer and trimer right, though! Corrected.

    Amusing, really. Nick Matzke made the exact same initial mistake over at UD (and corrected it).

    So I see two ways to decrease the size of the amino acid sequence space.

    Reduce the number of amino acids or reduce the length of the sequence.

    I was just seeing if I could get Nick to agree with me. No luck. He seemed to think there was a third way.

    Anyways, Nick went off on how wrong IDists are in how they calculate the size of amino acid sequence space. You know how he is. Post and run. That’s the background. I was trying to get him to defend his claim or retract it. 😉

  7. Mung,

    Educating UD would be a full time unpaid job if you’re not careful. Many of the design advocates don’t understand basic stats!

  8. A thankless unpaid job, since no one at UD ever remembers anything from one thread to another.

  9. Meanwhile, here at TSZ, “Definitions don’t matter” is forever enshrined in our consciousness.

  10. Mung: Amusing, really. Nick Matzke made the exact same initial mistake over at UD (and corrected it).

    Yeah, well. Nick possibly had a mild dose of the man ‘flu that is still causing my brain to work at tick-over.

    So I see two ways to decrease the size of the amino acid sequence space.

    Reduce the number of amino acids or reduce the length of the sequence.

    That early organisms functioned on a smaller set of amino acids seems an unavoidable conclusion. As Allan Miller remarked upthread:

    Amino acids group on chemical property – some acids simply have different lengths of side chain for example, but the groups capping those chains are the same, which renders a hydrophobic residue (say) substitutable by another at many of its sites.

    Similarly, with sequence length, one can postulate pathways from shorter sequences to longer ones.

    On the other hand, I was wondering if you were going to raise the needle-in-a-haystack argument such as that advanced by Kirk Durston. The assumption that unknown sequences are mostly non-functional is unsupported.

  11. Rumraket,

    It is not within the catalytic repertoire of RNA to make amino acids.

    Is that definitive or provisional?

    With regards to your latter point, I don’t agree with this part: […]

    It think it is rather easy to avoid that conclusion, since as I suggested, you could postulate that those ribosomal proteins began initially as small peptides (so I guess, strictly not “proteins”) with highly probable sequences (abundant with glycines, Arginines and so on) making them available in the environment, without being necessarily required to be made through translation.

    The main reason you have for thinking protein necessary for a functioning ribosome is the presence of peptides in there now. But those peptides are ribosomally synthesised. Of course there could have been ‘other peptides’ first. But one does have a transitional problem. The ‘other peptides’ sequences or structures cannot be passed across to the ribosomally synthesised ones, so one has to, for no particularly compelling reason, imagine one set of essential peptides being replaced by another, without breaking function. I see no more reason to accept this as opposed to starting with a purely-RNA ribosome, even if one can imagine getting lucky with functional short-peptide few-acid sequences. Some ‘luck’ must be there somewhere, but it seems more likely to me that a useful peptide replaces no-peptide than that one useful peptide replaces another.

    Also, the structure of the ribosomal RNA itself could have been different, with the current one being dependent on the current proteins, and the primordial one being dependent on primordial peptides. In this sense, both the ribosomal RNA and it’s associated proteins co-evolved into what we see today, but were always co-dependent. What makes you think the scenario you suggest is more likely?

    I think it more likely because of the transition issue. Of course one can imagine ways round it, but I just don’t see why we have to be so keen to get peptides in there early. It seems to be based on a prejudice relating to modern biochemistry. Certainly there is co-evolution. But it is co-evolution between the rRNA and the DNA that is translated into ribosomal protein. Speculative precursor co-evolution with a different peptide generation system is possible but unnecessary. And it cannot be continuous co-evolution from the one peptide system to the other. The root of the tree of modern ribosomal protein sequences must come after the root of the tree of rRNA. Whether this is the beginning of ribosomal protein in toto is unclear, but there is no strong reason to doubt it.

  12. Mung,

    Meanwhile, here at TSZ, “Definitions don’t matter” is forever enshrined in our consciousness.

    That may be how you are reading things, but no. Definitions matter, which is why we argue about them. But what you can’t do is use a particular definition to prove something ‘out there’ in the world. Such as … ooh, I dunno, just plucking an example out of thin air here … “the genetic code is a code as per this definition, therefore … X”.

  13. Alan Fox: That early organisms functioned on a smaller set of amino acids seems an unavoidable conclusion.

    Similarly, with sequence length, one can postulate pathways from shorter sequences to longer ones.

    We seem to be in agreement then.

    The size of amino acid sequence space can be reduced by:

    1.) reducing the number of amino acids
    2.) reducing the length of the sequence.

    By the way, Joe made a good point upthread wrt visualizing the space.

    The sequence space has one dimension per amino acid or nucleotide in the sequence leading to highly dimensional spaces.

    https://en.wikipedia.org/wiki/Sequence_space_%28evolution%29

    Nick seemed to think there was some other way to reduce the size of amino acid sequence space (the size of the search space). That’s what my post at UD was about. I’m still waiting to see if he will answer what other way there is. He seemed to think he had a knockdown argument against ID based on it. whatever.

  14. Mung: He seemed to think he had a knockdown argument against ID based on it.

    There is no knockdown argument against, even though there are knockdown argument about specific claims made by IDists.

  15. petrushka: There is no knockdown argument against, even though there are knockdown argument about specific claims made by IDists.

    A knockdown argument against ID would be one that demonstrated life can arise with just matter and energy, ie physicochemical processes.

  16. Allan Miller:
    Mung,

    That may be how you are reading things, but no. Definitions matter, which is why we argue about them. But what you can’t do is use a particular definition to prove something ‘out there’ in the world. Such as … ooh, I dunno, just plucking an example out of thin air here … “the genetic code is a code as per this definition, therefore … X”.

    Alan a genetic code is a code and all knowledge and experience says that codes need an intelligent agency (humans). Seeing that humans could not have designed the genetic code we infer it was some other intelligent agency. And if you can show otherwise, that mother nature can produce codes, then you could win 3.1 million dollars.

  17. Frankie: A knockdown argument against ID would be one that demonstrated life can arise with just matter and energy, ie physicochemical processes.

    Life IS physicochemical processes. A cell undergoing cell division is life being created by physicochemical processes. So there you go.

  18. Frankie: Alan a genetic code is a code and all knowledge and experience says that codes need an intelligent agency (humans). Seeing that humans could not have designed the genetic code we infer it was some other intelligent agency.

    We could also infer that the opening premise is wrong (that codes require “intelligent agency”), since there was no known intelligent agency around to design the genetic code.
    So it is possible the genetic code is a counterexample to the claim that codes need intelligent agency to originate. How do we find out which is true?

    Frankie: And if you can show otherwise, that mother nature can produce codes, then you could win 3.1 million dollars.

    Fraunhofer lines are barcodes for the elemental composition of the light emitting object.

  19. The genetic code is not a code in the ‘needs-a-designer’ sense. At least, it has not been shown to be so, merely attempted to be defined as such, which was my point.

    There is no lookup, no mapping, no sender or recipient, simply a causal chain involving n (currently 20) sequence-related aminoacyl-tRNA synthetases. If you can get from 0 to 1, you can get from 1 to 20. What would prevent natural processes from aminoacylating a single tRNA with one acid?

    Go on, insist ‘it’s a code’, again! I love the sound of endless repetition in the morning.

  20. Rumraket: Life IS physicochemical processes. A cell undergoing cell division is life being created by physicochemical processes. So there you go.

    That’s your opinion and only your opinion. So there you go

  21. Rumraket: We could also infer that the opening premise is wrong (that codes require “intelligent agency”), since there was no known intelligent agency around to design the genetic code.
    So it is possible the genetic code is a counterexample to the claim that codes need intelligent agency to originate. How do we find out which is true?

    Fraunhofer lines are barcodes for the elemental composition of the light emitting object.

    3.1 million dollars. Until it gets collected you have nothing.

  22. Frankie, I’ve moved a comment of yours to guano. Please try and post comments with some content relevant to the topic in hand.

  23. Mung: Nick seemed to think there was some other way to reduce the size of amino acid sequence space (the size of the search space). That’s what my post at UD was about. I’m still waiting to see if he will answer what other way there is. He seemed to think he had a knockdown argument against ID based on it.

    I don’t see where that occurred. Do you have a link?

    Add to your list that potentially functional protein sequences may not be sparse in sequence space and that new proteins tend to be variations on existing proteins rather than complete “shots-in-the-dark”.

  24. Frankie: That’s your opinion and only your opinion.

    So you are aware of a process or interaction taking place inside, say, E coli, when it undergoes cell division, that isn’t a physicochemical process? I’d like to hear about that one with some citations.

  25. Rumraket: So you are aware of a process or interaction taking place inside, say, E coli, when it undergoes cell division, that isn’t a physicochemical process? I’d like to hear about that one with some citations.

    Where’s your citation to support your claim?

  26. Alan Fox:
    Frankie, I’ve moved a comment of yours to guano. Please try and post comments with some content relevant to the topic in hand.

    My comment was relevant to the post I was responding to.

  27. Allan Miller:
    The genetic code is not a code in the ‘needs-a-designer’ sense. At least, it has not been shown to be so, merely attempted to be defined as such, which was my point.

    There is no lookup, no mapping, no sender or recipient, simply a causal chain involving n (currently 20) sequence-related aminoacyl-tRNA synthetases. If you can get from 0 to 1, you can get from 1 to 20. What would prevent natural processes from aminoacylating a single tRNA with one acid?

    Go on, insist ‘it’s a code’, again! I love the sound of endless repetition in the morning.

    The genetic code needs a designer as yours doesn’t have a mechanism capable of producing one and codes only come from intelligent agencies.

  28. I’ve released some comments by Frankie. That doesn’t imply any endorsement. Quite the reverse. 🙂 Ignoring might be the least time-wasting option.

  29. Alan Fox:
    I’ve released some comments by Frankie. That doesn’t imply any endorsement. Quite the reverse. Ignoring might be the least time-wasting option.

    Ignoring is all you can do as you cannot support your position

  30. Frankie: The genetic code needs a designer as yours doesn’t have a mechanism capable of producing one

    Non-sequitur fallacy. The conclusion doesn’t follow from the premise.

    Frankie: and codes only come from intelligent agencies.

    Question-begging fallacy.

  31. Rumraket: Non-sequitur fallacy. The conclusion doesn’t follow from the premise.

    Question-begging fallacy.

    LoL! Nice non-arguments. Your entire position is a question-begging fallacy. We are still waiting for your testable mechanism for producing the genetic code.

  32. Let’s see if I understand the argument.

    Codes require a designer because they can only be the product of an intelligent agency.

    The genetic code needs a designer because we don’t know how it came about.

    I’d say this is the apex of intelligent design reasoning. I don’t know of any better arguments.

  33. petrushka:
    Let’s see if I understand the argument.

    Codes require a designer because they can only be the product of an intelligent agency.

    The genetic code needs a designer because we don’t know how it came about.

    I’d say this is the apex of intelligent design reasoning. I don’t know of any better arguments.

    LoL! You don’t understand the argument.

    All known codes come from an intelligent agency
    Mother nature is incapable of producing codes
    Therefore when we observe a code and don’t know the cause it is safe to infer it was via some intelligent agency

    Science 101. And to refute that all you have to do is show that mother nature can produce codes- you will also win 3.1 million dollars.

    And BTW, that argument is better than any argument you can muster in support of evolutionism. The “apex” of evolutionism is to whine about ID.

  34. Frankie: Mother nature is incapable of producing codes

    I’m interested in how you have proven this negative. Care to share?

  35. Frankie: Therefore when we observe a code and don’t know the cause it is safe to infer it was via some intelligent agency

    You mean god, right?

  36. Frankie: That’s your opinion and only your opinion. So there you go

    It sure beats the opinion of a scientifically untrained and unknowledgeable toaster repairman.

  37. Adapa: No, this is false.Many examples have already been provided.

    That is a false statement. No one has ever provided any evidence that mother nature can produce a code. You have proven that you don’t know what a code is.

  38. Adapa: It sure beats the opinion of a scientifically untrained and unknowledgeable toaster repairman.

    I understand science better than you ever will.

  39. OMagain: I’m interested in how you have proven this negative. Care to share?

    The same way archaeologists do when determining something is an artifact. The same way SETI will determine a signal is artificial. The same way forensics determines a crime has been committed. It’s called science.

  40. Frankie: LoL! A literature bluff? Really? Is that the best that you can do?

    The known processes of binary fission are described therein. They all have a molecular and physical basis. There is no bluff here, it’s not just my opinion. It is simply a demonstrable fact.

    I recommend you read the opening paragraphs in the chapter
    “Part I Chemical and Molecular Foundations”
    and the subsections
    “1 Molecules, Cells, and Evolution”
    and perhaps most importantly:
    “2 Chemical Foundations”

    I take it you can guesstimate the subject from the titles of the chapters alone. I’d quote them at you if it wouldn’t be a total waste of space in the thread to copy-paste what you can just go and read yourself.

    I’ll leave you with this little nugget however: “The life of a cell depends on the thousands of chemical interactions and reactions exquisitely coordinated with one another in time and space under the influence of the cell’s genetic instructions and it’s environment. By understanding at a molecular level these interactions and reactions, we can begin to answer fundamental questions about cellular life: How does a cell extract nutrients and information from its environment? How does a cell convert the energy stored in nutrients into the work of movement or metabolism? How does a cell transform nutrients into the cellular components required for its survival? How does a cell link itself to other cells to form a tissue? How do cells communicate with one another so that a complex, efficiently functioning organism can develop and thrive? One of the goals of Molecular Cell Biology is to answer these and other questions about the structure and function of cells and organisms in terms of the properties of individual molecules and ions.”

  41. Frankie: LoL! Nice non-arguments. Your entire position is a question-begging fallacy. We are still waiting for your testable mechanism for producing the genetic code.

    I don’t care that I’m not proving to you how the genetic code originated (I don’t know how the genetic code originated in any satisfying level of detail). I don’t have to do that.
    I need merely show, as I have done, how the arguments you use to try and prove the code was designed, fails to accomplish this task.

Leave a Reply