The rise of genetic architecture

In science, like most things, one prefers simple over complex whenever possible. You keep adding variables until the explanatory juice starts hitting diminishing marginal returns. So cystic fibrosis is due to a mutation at one gene, and the disease expresses recessively at that locus. The reality is that one mutation accounts for ~65-70% of cystic fibrosis cases around the world, and there are nearly ~1,400 known mutations on the CFTR locus. How about skin color? Mutations on a dozen genes can probably explain ~90% of the variance in the trait value across the world between populations. In fact, one single mutation on one base pair can explain ~30-40% of the trait value difference between Europeans and Africans. This is a more complex story that cystic fibrosis; you have not just many mutations, but many mutations across many genes. But, the number of genes and mutations are manageable. You can keep track of most of them in your head (e.g., I can tell you that SLC24A5, SLC45A2, KITLG, and HERC2, can explain most of the trait value difference between Africans and Europeans without looking it up).

Now think about something like height. The only gene I can think of off the top of my head is HMGA2. With obesity I know FTO. The reason is that there's a veritable alphabet soup of genes which pop out of the numerous studies focusing on these traits. But the reality is that it seems possible that there are many genes which harbor variants of small effect size which in totality account for the range of the trait value. Abstractly this isn't really that much more complex than the models above. You can imagine it as a concrete instantiation of the central limit theorem. But in practice it does change things when you can't focus on one gene, or a few genes, but have to understand that there exists a huge class of genetic causes which modulate the expression of the phenotype. We've reached a stage where the mapping from genotype to phenotype is getting a bit on the baroque side. We have come to confront and wrestle with 'genetic architecture.' Here's what Wikipedia says about this term:

Genetic architecture refers to the underlying genetic basis of a phenotypic trait. A synonymous term is the 'genotype-phenotype map', the way that genotypes map to the phenotypes. The genotype-phenotype map has been analyzed in terms of several principal axes: epistasis, polygeny, pleiotropy, quasi-continuity, modularity, phenotypic plasticity, robustness, and evolvability.

And it gets more complicated. Epistasis comes in different flavors. As for the polygenic traits, they also exhibit differences. Pigmentation seems to be a trait where there really are common variants of very large effect. In contrast, for height, obesity, schizophrenia, and I.Q., no one has found them yet if they exist. So polygeny itself has many shades. Combine pleiotropy, the effect of one gene on multiple traits, with polygeny and epistasis, and the tangle of abstraction gets intractable very quickly. This is why the arguments about synthetic associations can be difficult to unpack. Not only do you have the old problems with complex genetic architectures, but you also have to keep track of concepts such as linkage disequilibrium as well as a model of the physical embodiment of genetic information in the chromosome. Alas, we're way past the "spherical cow" phase of simplifying for purposes of intelligibility. So why does this matter? It's about the "missing heritability". We know that height is about ~80-90% heritable in developed societies. If you are adopted your height is going to correlate with your birth parents, not your adoptive parents. But very little of the variance in height can be accounted for by genes detected in linkage or genome-wide association studies (GWAS). Neither of these techniques have the power to pick out thousands of alleles of small effect. Linkage is good at detecting rare large effect variants (usually in families), while GWAS picks up more modest effect but common variants (usually in study samples of the same ethnicity). Unfortunately GWAS hasn't been that effective in accounting for much of the variation which we see around us. Old fashioned quantitative genetics using statistical techniques based on family relationships is still a better bet for many traits and diseases (e.g., I have a family history of type 2 diabetes, but 23andMe gives me no greater risk). A group last year suggested a solution to the conundrum of why GWAS wasn't picking most of the genetic variation: synthetic associations. Let me jump to their author summary:

It has long been assumed that common genetic variants of modest effect make an important contribution to common human diseases, such as most forms of cardiovascular disease, asthma, and neuropsychiatric disease. Genome-wide scans evaluating the role of common variation have now been completed for all common disease using technology that claims to capture greater than 90% of common variants in major human populations. Surprisingly, the proportion of variation explained by common variation appears to be very modest, and moreover, there are very few examples of the actual variant being identified. At the same time, rare variants have been found with very large effects. Now it is demonstrated in a simulation study that even those signals that have been detected for common variants could, in principle, come from the effect of rare ones. This has important implications for our understanding of the genetic architecture of human disease and in the design of future studies to detect causal genetic variants.

To understand the logic, you need to recall that the SNP which is reported in a GWAS may not be the causal variant. In other words the SNP is just a marker which is nearby the real genetic cause, but is associated closely enough that the correlation is such that you can substitute the two in terms of their presence for purposes of predicting trait value. This has cropped up as a major issue with the genetics of blue eyes. This is a 'quasi-Mendelian' trait. It looks like most of the variation in Europeans is due to differences in the genomic region spanning the nearby genes HERC2 and OCA2, but different studies report different SNPs and haplotypes as diagnostic. It is unlikely that all of these markers are causal, so most of them are just strongly correlated with the true functional variant. Because of recombination, where chromosomal regions cross over and swap partners, these sorts of associations break down over time. So linkage disequilibrium, where genetic variants (alleles) across loci (genes) exhibit non-random statistical associations, varies over time as the correlations decay due to recombination. Synthetic associations are hypothesized to be cases where very low frequency large effect variants are associated with a more common variant, the latter of which shows up in a GWAS as the associated signal with the trait. Because the correlation between the causal variant and more common variant is going to be imperfect one will only explain a small proportion of the variance (if allele 1 one at locus A has frequency ~0.001 and allele 1 at locus B has frequency ~0.20, their association has to be less than 1 because the latter so outnumbers the former in terms of copies). Additionally, there may also be several low frequency causal variants associated with the common marker. In other words, the missing heritability isn't very missing at all. The GWAS are picking up genuine signals, only dampened because of the imperfect correlations between the high frequency marker and the low frequency causal variant. This has practical implications:

...The distance over which synthetic associations occur also offers an alternative explanation to the increasingly common observation of rare variants that occur within the vicinity of a GWAS signal but cannot explain that signal entirely. A simple explanation for such observations is that extending the sequencing to at least 4 Mb and ideally up to 10 Mb around the GWAS signal would pick up other rare variants. In some cases, identifying all the contributing rare variants may explain all of the original signal, whereas in other cases, there could be a combination of rare and common variants contributing. In addition, if synthetic associations are responsible for many of the observed signals, then sequencing in a small number of control samples (even over a much broader genomic region) is also unlikely to succeed. Under our model, the causal sites are both rare and relatively high-penetrant contributors to disease, and will therefore be unlikely to be detected in a small number of control samples. Finally, the focus of attention on genes that are near GWAS signals may be incomplete or misleading in that the actual causal sites may occur in many different genes surrounding the implicated common variant. It is also worth emphasizing that as few as one or two rare variants, at much lower frequency than the associated common SNP, can create a significant synthetic association. In such a case, sequencing a small number of cases that carry the “at risk” common variant might miss entirely the causal rare variants even if the correct genome region is resequenced. These considerations argue for caution in efforts to resequence around genome-wide associations and argue instead that genome-wide sequencing in carefully phenotyped cohorts might be a better use of resources.

One of the papers rebutting the one above, Rare Variants Create Synthetic Genome-Wide Associations, will be covered at Genomes Unzipped. So let's look at the other one. Synthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results. Frankly I found the paper hard going. The basic units of each section are intelligible, but recalling them as a coherent whole is not as easy. Part of the reason is that they take the simulations of the Dickson et al. paper, and raise them one. And simulations are to some extent "black-boxes," at least unless you replicate them and get a feel for how modulating the parameters tweak the outcomes.

First they explored how varying the number of rare causal variants associated with a common associated SNP would effect the distribution of frequencies of the latter, and how they compared to the empirical distribution detected. What's interesting here are panels A and D, E, and F. The first just shows the distribution of frequencies of detected SNPs in GWAS. They go from 0 to 1. D, E, and F simply show you the expected frequencies of the associated allele with the rare causal variants for a given k of variants. 1, 9, and 18, respectively. What you see is that for synthetic associations the distribution of variants associated with the rare causal SNPs should skew toward the lower end. Also, they found that irrespective of the number of k variants the associated SNP only explained ~10% of the trait variance. Finally, they also suggested that the effect size of the rare variants would have to be very large indeed for the GWAS to pick up the associated SNP. This is a problem since there's only so much variance to go around. And, it begs the question: if the variants are of such large effect why didn't linkage studies pick any of them up? Speaking of large effect, once you start adding up k variants to a locus you begin to narrow the regions of the genome in which causal variants can concentrate within. They authors indicate that such clustering within the genome is simply not found, another argument against numerous synthetic associations.

Next they looked at results from schizophrenia research, and attempted to see how it mapped onto the predictions entailed by a synthetic association model. The top panel shows the observed data. Not quite a uniform distribution, but there are rare variants, and common variants, and variants in the mid-range frequency. The bottom panel shows simulated results using the synthetic model. As expected you see a skew toward rare alleles, and a deviation from what is observed. Additionally, they note that they ran the simulations with a lot of different parameters, and those that included common variant alleles always tended to have a better fit with the realized results than the synthetic model predicted on rare alleles of large effect size. The short of it is that the authors conclude that the model outlined last year simply does not fit the empirical results very well. They do not deny the existence of possible synthetic associations, but they seem to suggest that this variety of associations is not that important in explaining the missing heritability. Additionally, they note that rare alleles of large effect should not span populations, since they are likely to be evolutionarily novel. But recent work in fact suggests that risk alleles in one population is highly portable to another population. So genetic architecture may not matter as much we suspected when it comes to inter-population difference. Why is this important? Money and time, which are both finite:

Empirical observation suggests that much of the missing heritability is contributed by causal variants (including loci comprising multiple rare variants) having effect size too small to be detected with stringent statistical significance...Larger samples for GWAS are needed to detect these which would directly compete with research funds used in sequencing studies. ...Genes identified through GWAS harbouring common variants are likely to be good targets for identification of rare variants and for sorting the wheat from the chaff in next generation sequencing studies. We expect that continued GWAS will make valuable contributions to our understanding of many complex traits and will, for some time, remain as one important tool in a growing set of technologies to probe the full spectrum of genetic variation efficiently.

At the end of the day I'm interested in evolution. But to understand evolution you need to understand the genetic architecture of the traits which are the targets of natural selection. I've only skimmed the paper, so I really recommend you read the original for the "blood & guts." Actually, read it a few times! Also, please see David Goldstein's response. I felt he was rather cordial, given the rather forceful tone of the two papers which challenged the one that came out of his lab. Citation:

Wray NR, Purcell SM, & Visscher PM (2011). Synthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results PLoS Biology : 10.1371/journal.pbio.1000579

Image Credit: Sailko.

The rise of genetic architecture

Newsletter