The extent of variation present in human populations and the consequences of genetic load seem to be topics of perennial interest here (see, for example, recent comments in the Evolution Visualized thread). Recent issues of Nature have published a flurry of papers aimed at getting a better handle on just how much genetic diversity is likely to exist among humans. One notable paper from last August is the following:
Analysis of protein-coding genetic variation in 60,706 humans
In this study, Monkol Lek and many, many colleagues sequenced the exomes–i.e., the portion of DNA sequences that code for proteins along with some accompanying untranslated regions–of more than 60,000 people. The results were pretty spectacular. The paper is incredibly dense, but here are some highlights:
- The authors found more than 7 million reliably identified variants. Most were single base pair substitutions, but the variants also include more than 300,000 insertions/deletions.
- On average, 1 out of every 8 nucleotides is variable. However, the overwhelming majority of variants are rare. That is they are found in only a single or a few individuals
- The frequency of different kinds of variants is proportional to both the rate at which they occur as well as the extent to which they are likely to be deleterious. This is not at all surprising, but it’s neatly demonstrated. For example, 63.1% of all possible CpG transitions (i.e., a cytosine adjacent to a guanine that mutates to a thymine) were observed, while only 3% of possible transversions were present. CpG transitions are among the most common type of substitution in mammals, while transversions are less frequent. Likewise, the proportion of possible synonymous variants that were actually observed was much higher than the proportion of possible nonsynonymous variants that were observed, which is consistent with the generally accepted notion that nonsynonymous mutations are usually subject to stronger purify selection than synonymous mutations.
- They identified almost 180,000 different protein truncation variants (PTVs), which are protein-coding genes predicted to be shortened due to an introduced stop codon, a frameshift, or removal of a critical splice site. Amazingly (to me at least), the average genome in their dataset includes 85 PTVs in the heterozygous state and almost 35 PTVs in the homozygous state.
- They identified more than 100 variants previously thought to contribute to disease phenotypes that are present at anomalously high frequencies in human populations (> 1%). Based on the fact that the evidence of pathogenicity for most of these variants is actually extremely weak, the authors suggest that these variants are most likely benign.
There is a lot more in the paper that’s worth chewing over, so give it a read. This is easily the largest dataset of its type ever generated, but it has limitations. The sampling is heavily biased toward Europeans, and there is likely some variation missing, especially in Central and Middle Eastern Asia.
I imagine that within a few years, we’ll have datasets of similar size consisting of high-coverage, whole genome sequences, which will no doubt show even larger amounts of genomic variation. It’s an exciting time to be interested in biology!