I see long-time commenter at Uncommon Descent, Mung, in a thread entitled Backwards eye wiring? Lee Spetner comments, asks:
How do you calculate the size of amino acid sequence space?
As this seems somewhat off-topic there, I thought I’d attempt to answer Mung’s question. I’ll try and be brief. The two most fascinating biochemicals are nucleic acids (RNA and DNA) and proteins. Proteins seem ubiquitous in cellular systems; they function as catalysts (enzymes), structural elements (keratin, collagen), signal molecules (hormones, pheromones), binding agents (antibodies). Proteins are linear sequences of amino acids joined by a condensation (called so because a molecule of water is lost) reaction forming a peptide bond. There are twenty-one amino-acids found in eukaryotes and twenty of them are directly represented in the genetic code. The special case is selenocysteine which is coded indirectly and I’ll leave that out of the calculation for the sake of simplicity.
So what number of different amino acid sequences could theoretically exist, given twenty possibilities for each aa in the polymer. I guess we shouldn’t count twenty monomers. For dimers, there are 400 possibilities. For trimers, we have have 8,000 and so on. The general formula for the number of theoretically possible different protein sequences of length is . So the answer for all possible sequences is the sum of this calculation from to, well, what? There are some very large proteins; titin being the largest known at around 30,000 aa’s. So I guess we should sum at least to that number.
This is a very big number indeed! I leave it as an exercise for the reader to try representing the number that results when taking the upper limit of as 30,000. 🙂
Now I’ve answered Mung’s question, would he like to enlarge on what it signifies?
ETA categories and remove tautology
ETA 2 correction not (hat tip Joe Felsenstein)