Craig Venter has achieved celebrity status in Creationist circles for little more than a slightly embarrassed smile. In a discussion involving, among others, Richard Dawkins, Lawrence Krauss and Paul Davies, Venter makes the eyebrow-raising statement that he does not regard Mycoplasma as the same ‘life-form’ as other prokaryotes, or eukaryotes. His reasoning was that they have ‘different genetic codes’. Dawkins reasonably points out that their codes are ‘all but identical’ (they differ in just one position, Trp for STOP). Creationist videos of the exchange tend to fade on the aforementioned smile given in response. The videos are presented, breathlessly, as “Venter denies Common Descent in front of Richard Dawkins!”. However … one difference? Is that really enough to justify a claim of separate origins? This would be like claiming that Norwegian and Swedish had separate origins on the strength of the difference between æ and ä or ø and ö.
When last I looked, there were about 18 known variant codes. More are being discovered all the time. It’s instructive to compare the differences – but first, I’ll run through the basic mechanics of protein coding from DNA.
The code comprises groups of three DNA bases – triplets. With 4 different bases, there are thus 64 different possible triplets. One DNA strand’s sequence is transcribed to ‘messenger RNA’ (mRNA) using base-pairing affinities. A pairs with T (or U in RNA, simply a methylated T) and C with G so the transcript of ACGT is UGCA. This is a physical interaction and not merely an informatic one – ACGT binds most strongly to its complementary sequence (T/U)GCA. Once the mRNA is formed, possibly after some post-transcriptional editing, it is passed to a ribosome for translation into protein. A pool of up to 64 ‘transfer RNA’ (tRNA) molecules is maintained by up to 20 enzymes called aminoacyl tRNA synthetases – aaRSs. The aaRSs are the main Keepers of the Code. Each different aaRS will charge one or more of the different tRNAs with a specific amino acid. The acid is attached to a specific sequence – ACC – at one end of the tRNA. At the other end lies an ‘anticodon’ – a triplet with the same sequence as the original DNA triplet, which has the strongest binding affinity for the complementary sequence present in the transcribed mRNA – the ‘codon’. Charged tRNAs are like paint brushes with a dab of colour on one end and a unique shape on the other. Given a template to line them up and dock the appropriate shapes, dabbing the other ends on a growing string, the same pattern can be produced repeatedly and mechanically.
Translation commences at a ‘Start’ codon – typically, though not universally, corresponding to the amino acid methionine, codon AUG. Given that binding is most strong between a codon and its anticodon, each exposed mRNA triplet has the greatest affinity for just one tRNA. The relevant tRNA docks and the peptide is elongated with the amino acid at its other end. Translation ceases when no tRNA can be found corresponding to a given codon. Such codons – those without tRNAs – are termed STOP codons. Typically, this tends to be more than just a simple passive mechanism – the growing peptide does not merely ‘fall off’ the ribosome for want of an acid, but release factors are triggered, which break the bond between the synthesised peptide chain and the tRNA to which the last acid was attached.
The mechanism described above is universal – all organisms on earth synthesise their proteins in the same way. That alone should cause Venter to hesitate in his dubious assertions regarding ‘life forms’ of separate origin. If everything on earth shares the same system, it really does not indicate that they had separate origins, however much difference in detail. Venter argues elsewhere that sequence data starts to lose sight of a simple tree in deep time, and this is true to some extent, due to such factors as horizontal gene transfer – genes passed among organisms rather than inherited from common ancestors. Yet the very fact that gene transfer is even possible indicates that transfers are between relatives. The kind of things that might more securely indicate separate origins are, for example, L-sugars giving oppositely-coiled DNA, or D-amino acids, or acids other than the canonical 20, or base pairs other than the standard 2. What one would not expect is to be able to pass a ‘well-formed string’ from one organism to another of completely separate origin, yet have an equally well-formed string emerge after processing in a truly foreign system. This is like Jeff Goldblum uploading a virus to the alien computer from an Apple Mac in Independence Day – pure hokum.
Another aspect to consider is the structure of the aaRSs that determine the code. The 20 acids divide neatly between Class l and Class ll, based on their reaction chemistry. Class l enzymes attach the amino acid to the 2′ -OH of the substrate; all but one Class ll enzymes attach to the 3′ -OH. Either way, the acid migrates to the 3′ before the ribosome gets hold of it, so there is no effective difference. Every enzyme in each class bears a sequence relationship to all other enzymes in that class – but there is no such relationship between the classes. It appears from this data that aaRSs arose at least twice, and also that the modern code of 20 acids most likely arose from a much smaller set. All of this activity would have to have preceded LUCA, our last common ancestor, because the various quirks noted above are common to all life. How can that be? OK, one might say ‘design’ – a catch-all that can be invoked to explain all data – but that’s not what Venter thinks, so it’s not clear what explanation he would have for that massive commonality of mechanism and structure, if not genetic common origin.
But of course – as Dawkins pointed out – it’s not merely the common details of mechanism, but the code itself that speaks loudly of a single surviving origin. The commonest code is termed the ‘standard’ code. This is used in most eukaryotic nuclei, most bacteria and archaea, and plant plastids. On the principle of parsimony, this is most likely the code used by LUCA itself. It has 3 STOP codons – UAA, UAG, UGA – and usually – not always – starts with methionine. In Mycoplasma, UGA codes for tryptophan, which is coded solely by its close neighbour UGG in the ‘standard’ code. This is actually a very common substitution – the codes of nearly all mitochondria do exactly the same at this position, for example. Thus, Venter is effectively saying that our own mitochondria are not the same ‘life-form’ as our nuclei – a perverse notion, since many mitochondrial genes have migrated to the nucleus, and are synthesised there before re-export. He is saying, by implication, that the difficulties we would have synthesising Mycoplasma proteins, or vice versa, preclude gene migration from the mitochondrion, unless that migration preceded the amendment to the mitochondrial code. That’s a strong claim, even if made unknowingly.
But how can the genetic code change – how could we possibly have common descent of all codes? There are 3 principal possibilities for a code change
- Substitution of a STOP by an acid
- Acid-for-acid substitution.
- Substitution of an acid by a STOP
I have ranked them thus in order of assumed constraint against them.
- If a STOP is substituted by an acid, it will have the effect of adding a small ‘tail’ to those proteins having that STOP, leaving the core untouched, which is less likely to be damaging.
- If one acid is substituted by another, this may well affect the core of many proteins, which will have more impact, but the effect is mitigated if the codon is of low usage, or the substitution is for one of similar chemical property.
- The third is probably rare – it would have the effect of chopping proteins into short segments, depending on the stochastic occurrence of the relevant codon. Nonetheless, in short genomes with biases against certain codons or base pairs, it may occur with some frequency.
Now, the Mycoplasma/mitochondrial distinction is of the first type – the assumed ancestral code has apparently been amended by addition of UGA to the substrates accessible to tryptophanyl tRNA synthetase (an aaRS), which is specific to UGG in our nuclei, but UGA/UGG in mitochondria. A and G are both purines, by contrast with the pyrimidines U, C and T. The first are distinguished by a double ring structure, the second are single, and so they present quite different profiles to enzymes, more easily distinguished than the separate bases comprising each class. In this case, it would require a loss of distinction between the purines, rather than a gain of specificity. Nonetheless, it is very common within the code to see a 3rd-position distinction made simply on the basis of whether it is a purine or a pyrimidine, rather than at the level of individual base, so there is nothing exceptional about this particular substitution.
The supposition that filling in of STOP is the likeliest amendment is borne out by analysis of the genetic codes of extant organisms. It appears that the wholesale replacement of one acid by another has hardly occurred at all since the common ancestor; almost all variant sites function as STOP in one or more variant codes. Of 13 codons that vary in one species or another, 7 are a STOP in at least one. If we ignore the 4 codons restricted to yeast mitochondria in which the entire CUx group has seen a substitution of Threonine for Leucine, it becomes 7 out of 9, which is striking. There is a viable mechanistic reason for this, relating to the relatively mild effect of this particular substitution on existing proteins.
This is possibly the means by which the code itself arose – initially only a small amount of the 64-codon matrix was covered by assignments, as suggested by the apparent coalescence of the 2×10 Class l and Class ll aaRS enzymes upon fewer ancestral enzymes. In such a system, consisting of ‘mostly STOPs’, the extra tail added by assignment of a STOP would be quite short – as STOPs become fewer, the length of the average tail will increase, but in early evolution the constraint against this substitution would be less severe. Gradually, as the codon matrix becomes filled in and proteins become more widely used and longer, the code ‘freezes’.
Any acid-for-acid substitutions that divided a codon group, either due to purine/pyrimidine specificity, or individual base distinction, would tend to favour chemically conservative substitutions. This would generate the much-vaunted fault tolerance of the code – translation errors frequently result in a viable product due to chemically related neighbourhoods in the code. On this model, there is no need to use design or positive selection to achieve favourable arrangements; they are a by-product of the constraint on wholesale substitution.
Gradual filling in of STOPs would thus have led to both a richer code, biased towards construction of chemically conservative neighbourhoods, and incidentally to a gradual lengthening of proteins, amending both the v and the n of the assumed v^n ‘search space’ in which some imagine proteins must arise fully formed in a single bound.
And so, having reached more-or-less its present form in the population of which LUCA formed a part, descendants such as Mycoplasma and us sprang forth – with minor variations.
Edit to add – my spreadsheet of the up to date codes table. A code 32 was added just in the last day or so – although, pace J-mac, there are not 32 codes, but around 25, due to mergers. It is not always a straightforward issue to determine whether a code should be considered truly ‘different’ or not. For example, the GUG start codon is occasionally used in our own cells, but comes up as a difference when used in other codes because it is not annotated as part of the ‘standard’ code. And ciliates are just plain bizarre!
The sheet is colour coded to highlight the differences – red for assignment variants from the standard code, and yellow for Start/Stop differences.