Glups! My original idea was to make available some summaries I did of very interesting papers on Medical Genetics. However, due to workload I am afraid I won't be able to maintain this section. Thus, I will add new summaries in next future. I hope they can be useful to anyone. In any case, I strongly recommend to read these papers by yourself. Finally, I did not ask anyone for permission to put these summaries, but I hope the authors do not dislike the idea! (Egutegia dokumentuan 96. orrian gelditu naiz, hurrenguan handik hasteko)
Literature in Genetic Epidemiology
- 2010/12 - ESTIMATION OF EFFECT SIZE DISTRIBUTION FROM GWAS (JH Park et al. NG)
- 2010/11 - CONSISTENT ASSOCIATION OF T2D GWAS ALLELES IN DIVERSE ANCESTRIES (KM Waters et al. PLoS Gen)
- 2010/09 - TOWARDS COMPLETE RESOLUTION OF GENETIC ARCHITECTURE OF DISEASE (AB Singleton et al. Trend. Gen)
- 2010/07 - COMMON SNPS EXPLAIN 50% HEIGHT HERITABILITY (Yang J, Visscher PM et al. Nat Gen)
- 2010/05 - GWAS IN DIFFERENT POPULATIONS (Rosenberg et al. Nat Rev Gen)
- 2009/12 - DISEASES ARE QUANTITATIVE TRAITS (Plomin et al. Nat Rev Gen)
- 2009/10 - CARACTERISTICS OF GWAS SNPS (Manolio et al. PNAS) - Added 2012/01
- 2008/02 - WHAT CAN GWAS TELL US ABOUT THE GENETICS OF COMMON DISEASE? (Mark I. Iles PLoS Gen) - Added 2012/01
- 2003/05 - CONFOUNDING, ASC. BIAS, AND THE BLIND QUEST FOR A GENETIC FOUNTAIN OF YOUTH (Terwilliger Ann Med)
- 2003/01 - METAANALYSIS OF ASSOCIATION STUDIES SUPPORTS CD/CV (Lohmueller et al. AJHG)
- 2002/10 - COMPLEX DISEASES: CV/CD or RV/CD? (JK Pritchard et al. HMG)
- 2010/10 - ADAPTATION NOT BY SWEEPS ALONE (JK Pritchard, A Di Rienzo Nat Rev Gen) SUMMARY: From detected loci * power to detect => true distribution of risk alleles INTRODUCTION
- 2nd generation of GWAS have problem: low-hanging already detected
- We integrate power over the number of unidentified susceptibility loci that probably exist, accounting for the distribution of RR and allele frequency
- Seen effect sizes distribution is skewed (bias in favor of larger effect sizes, for which power is greater) RESULTS
- Height: TO FINISH!
1.- All 19 SNPs polymorphics (MAF >.05) in all groups (except rs10923931 - NOTCH2 - Japanese and Hawaians and rs7903146 - TCF7L2 - Japanese
2.- For several they have low power, but consistent direction: 50% expected by chance, and from 63% in EA to 100% in Japanese
3.- Hom OR > Het OR...expected if allelic dosage
4.- When EA out, ALL (!) with same direction and 13 at p=0.05
5.- Although consistent OR, they dont explain differences in disease prevalence
1.- Most, if not all, of these alleles show directionally similar association to T2D across many populations. Such a pattern indicates that the causal alleles at these validated risk loci (which have yet to be found) likely predate the migrations that separated after the out of Africa.
2.- Unexpected for "synthetic associations" hypothesis
Literature in Human Population Genetics
They examined 19 validated risk alleles for T2D in EA, AA, Latino, Japanese American, and Native Hawaiian T2D cases (n =6,142) and controls (n= 7,403) from the population-based Multiethnic Cohort study (MEC)
- It is a short review on the possible spectrum of disease variants (Mendelian OMIM, common GWAs and now rare RESEQUENCING). Good statement of were they think missing heritability lies:
- "We hypothesize that a substantial proportion of this undiscovered fraction will be heterozygous loss-of-function alleles. Our argument is as follows. We now know that many common low-risk variants have modest effects (10-20%) on gene expression (see below). Loss-of-function variants effectively lead to a 50% loss of expression and would therefore generally be predicted to have larger effects on disease risk. We also know that there are many ways of disrupting gene function: whole or partial gene deletions, missense mutations, and splice-site changes. Each individual variant will be rare, although the combined burden of all the rare variants at each locus could be substantial. Loss-of-function changes will necessarily have low population frequencies because homozygotes would suffer from severe early-onset disease consequences, and so the most likely scenario for such loci is that rare and diverse loss-of-function alleles will occur in any population. Perhaps the best example of a moderate-risk, low-frequency locus is provided by the loss-of-function alleles of glucosecerebrosidase in Parkinson's disease . These were identified through the clinical observation that first and second-degree relatives of Gaucher's disease have increased incidence of Parkinson's disease
- 3.925 individuals with pairwise relationship less than 0.025
- Restricted maximum likelihood (REML) that explains 45% variance!
- Where is the rest? LD? If LD problem, the matrix of relationship between individuals at causal loci (Gjk) is different than the SNPs from the chip (Ajk)
- They have no causal loci, but test what if taking 50, 100, 150K SNPs or MAF >.1, .2, .3?.
- They correct for uncatched low MAF SNPs, and would explain all h2 (85%)!
- Variance doesn?t depend on number of genotyped SNPs (= 300 than 500K SNPs)
- To validate, they simulate known locus of disease, with two situations: after they have all SNPs or they have just SNPs of MAF > 0.1
- In the discussion: if SNPs are very rare, low LD, never catched in GWAs
- They correct for PCA (eigenvectors), but same results and AIMs uncorrelated with height
a) Why do we need more populations?
- Some variants may exist just in some populations (MYBPC3 in Indians)
- Shared, but allele frequency might differ among populations (TCF7L2 or KCNQ1 in T2D)
- Disease have different prevalence among populations (power)
- Different OR among populations (APOE)
b) Challenges in non-Europeans?
- Marker ascertainment: there is a European bias and can decrease power out of Europe
- Tag-SNP portability
- Genotype imputation (difficult in Africans or admixed)
c) Problems with independenceHowever, in standard GWA analyses, in which alleles that are more common in cases than in controls are identified by testing contingency tables locally along the genome, an implicit assumption is that the genotypes of separate individuals can be treated as independent random variates. Approximating separate individuals as independent has been productive as a first approximation, but more information is potentially available by accounting for correlation among individuals resulting from shared descent. A genealogical perspective suggests that replication studies are in fact pseudoreplication studies, as potential correlations could arise from shared genealogy. From this viewpoint, particularly in small populations, separate association studies that identify the same risk variant in a population might not provide the same degree of confirmation as replication studies that are conducted in a context in which events are truly independent. Although efforts have been devoted to statistical issues of replication in relation to sample size and measured effect-size (ref 122,123), studies of the population genetics of replication are in their infancy. As the frequency of replication studies continues to increase, methods for evaluating intrinsic correlations between study outcomes and their effects on interpretations of replication studies would provide a useful development.
d) Example of common variants in T2D
- TCF7L2: Grant et al.132 identified common alleles in TCF7L2 as being associated with T2D, a finding that has been confirmed in many populations, including other Europeans133,134, Africans135, East Asians136, South Asians137 and Mexicans138. Subsequent analysis of T2D in East Asians suggests that whereas OR are similar for TCF7L2 variants in East Asians, risk allele frequencies are substantially lower, so larger samples are needed to identify the association139.
- KCNQ1: The first T2D GWA studies in East Asians identified variants in KCNQ1 (13,14). A recent meta-analysis in Europeans carried out by the DIAGRAM with similar effect size but not genome-wide significant due to a much lower risk allele frequency (DIAGRAM Consortium, personal communication). Interestingly, this same meta-analysis identified a second genome-wide significant T2D association signal ~150 kb from those discovered in East Asians
- INTRODUCTION: In our opinion, the polygenic liabilities that emerge from GWA research will lead to common disorders being thought of as the extremes of quantitative traits and, ultimately, to a scientific focus on quantitative traits rather than disorders. The Mendelians were wrong in assuming that complex traits show simple Mendelian patterns of inheritance. The biometricians were wrong in concluding that Mendelian laws of inheritance are particular to pea plants and do not apply to complex traits. Quantitative geneticists investigated the genetics of naturally occurring phenotypic variation by using methods - that estimated the cumulative effect of genetic influence regardless of the number of genes involved or the complexity of their effects. By contrast, the progenitors of molecular genetics studied how genes work rather than particular phenotypes: they focused on rare naturally occurring monogenic effects or experimentally induced ones.
- THINKING QUANTITATIVELY:The finding that multiple DNA variants are associated with common disorders is leading to disorders being thought of in quantitative terms. But which quantitative traits underlie the common disorders that we study? For some disorders, the relevant quantitative traits seem obvious, such as body mass index (BMI) for obesity, blood pressure for hypertension, and mood for depression. The relevant quantitative traits are not at all clear for most disorders - including alcoholism, arthritis, autism, cancers, dementia, diabetes and heart disease. For some traits, such as type 2 diabetes (T2D), a quantitative approach has already been embraced, with striking results9. Recent studies of Crohn's disease (CD)12 have provided less well-known examples of how quantitative traits that are relevant to disease might arise from GWA studies.There is considerable and unexpected pleiotropic overlap among the genetic variants that affect different human diseases. These findings have inspired the concept of the human diseasome - the synthesis of all human genetic disorders (the disease phenome) and all human disease genes (the disease genome). We use the term polygenic risk score to refer specifically to the set of multiple DNA variants that are associated with a disorder. (Such composites have also been called polygenic susceptibility scores27, genomic profiles28, SNP sets29, genetic risk scores30 and aggregate risk scores31.) Polygenic risk scores are beginning to be used to predict the population-wide genetic risk for common disorders, such as breast cancer32, atherosclerosis30, coronary heart disease33, age-related macular degeneration34, recurrent venous thrombosis35 and T2D36From an evolutionary perspective, averageness might be an adaptive trade-off against the mixture of costs and benefits of more extreme polygenic liabilities, especially given the fluctuating nature of selection37. An adaptive edge for averageness also suggests an evolutionary mechanism that would maintain genetic variation in the population and therefore account for the ubiquitous high heritabilities of traits across the life sciences. Studying populations rather than probands might lead to multivariate quantitative traits that reflect polygenic risk scores better than do the disorders themselves. Research on the full range of normal variation in a population also has implications for GWA studies, as described in the following section. First, for common disorders, statistical power can be greatly enhanced by conducting GWA studies that compare the low and high extremes of quantitative traits or by studying the entire distribution, rather than dichotomizing the same distribution into cases and controls39. Case-control studies of rare disorders already achieve a form of selection of one population extreme by the overrepresentation of cases compared with normal controls; however, as disorders become more common, a quantitative approach becomes more powerful because the cases become less extreme and the control group becomes increasingly contaminated by near cases - that is, individuals who nearly reach the diagnostic thresholds for the disorder
- PERSPECTIVES: When several genes affect a disorder, the polygenic liability is continuously distributed as a normal bell-shaped curve, which means that what we call common disorders are, in fact, the quantitative extremes of continuous distributions of genetic risk. The obvious test of this hypothesis is that polygenic risk scores will be associated not only with differences between cases and controls but also with individual differences throughout the entire range of variation. For example, genes that are found to be associated with hypertension in case-control studies are predicted to be correlated with the entire range of variation in blood pressure, just as genes that are found to be associated with obesity in case-control studies are correlated with the entire range of variation in BMI43. The problem is that for most disorders, we do not know what the relevant quantitative traits are. Even when we think we know the relevant quantitative traits, we are likely to be surprised, as research on polygenic risk scores considers the full range of normal variation in population cohort studies and captures the diffuse pleiotropic effects of genes
CARACTERISTICS OF ASSOCIATED SNPS
- 151 (of 237) GWAS (2008/12) => 531 SNPs at 5*10-8 => median RAF 36%, IQR, 21-53
- 465 unique TASs: 43% intergenic, 45% intronic, 9% nsSNP, 2% UTR and 2% synonymous
- 227 (43%) for discrete traits => OR from 1.04 to 29.4 (median 1.33, IQR 1.20-1.61)
- Trait prevalence not associated with OR or RAF
- 18 >1 trait and several in previous: APOE, HLA, KCNJ11, PPARG and CARD15
- TAS tend to be common, low OR, low FST, enriched nsSNPs and promotors (but 88% intergenic and intronic)
FUNCTIONAL AND EVOLUTIONARY ANALYSIS
- TASP and SNPs at r2>0.9 and assign to 20 annotation sets, counting sets which TAS had at least 1. They test how many blocks (%) do map to a set compared to 100 random blocks
- 9 cathegories enriched: nonsynonymous sites, 1kb promoters, 5kb promoters, most conserved sequences (MCSs), 3_ UTRs, microRNA target sites, Introns, CpG islands and experimentally validated regulatory regions from ORegAnno)
- Specially, Non-synonymous sites (more if taken only predicted deleterious)
- TASs proportion with iHS > 90th percentile (1.635) slightly higher (OR = 1.3 [0.8-2.1], p = 0.2)
- Some alleles with extreme his (thrifty gene hypothesis). Ancestral/Derived study worthy!
- Replicated SNPs tend to be very common and frequency inversely correlated with estimated effect size
- Our simulations suggest that such observations are the result of power
- We simulated data to see what underlying distributions could give rise to the observed risk distributions
- 54 studies (22 different diseases) (Table S1) examined. In total, 45 disease-associated SNPs
- MAF => median frequency of 0.40 (95% CI: 0.37, 0.48) and a mean of 0.43 (95% CI: 0.39, 0.49). 3/43 MAF < 0.1
- Pearson's correlation between RAF and OR => -0.28 (95% CI: -0.53,0.07, p = 0.07) (Figure 1D). But correlation between MAF and OR => -0.48: 95% CI: -0.66, -0.19, p = 0.001
- Thus, it could be power rather than purifying selection ==> let's see simulations
- Mutation rate to create risk alleles only low frequency, less low or mainly common (betaS = 0.1, 1 and 3). N = 1000 or 3000 and GRR of 1.2, 1.5 and 2. They "genotype" SNPs mimicking those from chips, do Cochrane-Armitage test and set WGS at 5*10-7.
- When GRR = 2 distributions are easily discernable. When GRR = 1.5 distributions are similar for n=1000 and when n=3000 clear increase in rare variants for Beta 0.1 or 1. Finally, with GRR = 1.2, all look similar
- Whatever true risk distribution, results suggest that unless the effect size or sample size is large (GRR > 1.5 or n = 3,000), simulations with mostly rare (betaS = 0.1) or common (betaS = 3) susceptibility alleles produce similar distributions of risk alleles that look Normal and not too skewed (median of 0.2-0.4)
- The frequency of associated allele is thus a good indicator of the frequency of the genuine disease allele
- These results suggest that the distribution of frequencies for confirmed disease-associated alleles is far more reflective of power than of the underlying distribution of disease alleles
- Chronic diseases typically strike late in life and are etiologically complex
- When we found alleles, they tend to be involved in only a small subset of all cases of such diseases
- Genetics has two senses: first, that genes act over everything, and second, that variant in genes performs variation in phenotypes
- People do mix them: that genes control biology does not mean that genetic variation is a fundamental cause of every disease
- People usually forget something of the thrifty allele hypothesis: they become related to disease, because of environmental changes (you can not propose to change the genes!).
- Association studies: the future, might be, but if more sample size, obviously more heterogeneity (reduced power per sample)
- Association studies use the fact that all humans perform a pedigree (if something related to disease, identical by descent in a LD block)
- It is called association because of phenotype-genotype (not correlation between relatives in a location as in linkage)
- For LD-based mapping to work, we need to ascertain a large sample the phenotype is due to same genotype to have enough power to gent significant, but only those common variants with very marginal effect but that affect a large proportion of disease aetiology (if you see the phenotype, likely the genotype is there)
- People say "well, low OR, but large attributable fraction of the disease" this makes sense in epidemiology, but not in genetics (you can not kill the people who carry this allele!)
- It is better to find those one with very large OR (therapy makes sense!).
25 Associations >> 301 studies after excluding 1st positive report. 59 and 26 studies at p < 0.05 / 0.01 (only 15 and 3 expected)
TWO MAIN QUESTIONS
- Publication bias or true associations??
1.- 47/59 at same direction (while 50% expected)
2.- Lots of unpublished to see 59 at p < 0.05 (and 47 same allele):a. 1/20 at 0.05 and same allele = 1/40 47 / (1/40) = 1800 (1302-2501) ==> 1579 unpublished?? Too much. b. If 12 other allele and 0.05 12 / (1/40) = 480 total pool of studies ==> still 179 unpublished?
3.- Funnel plot analysis
4.- Moreover, 44/47 replications with same allele accumulate in just 11 associations
- If so, why such inconsistency??
They model heterogeneity among studies. They take chi-square values for each study. They use Pearson Chi goodness-of-fit. If p > 0.05, they are homogeneous. But, 10/25 have p = 0.05. However, 6/10 heterogeneity disappears after taking out highest chi value (or less than 5% of values). Still, 4 associations with heterogeneity FOR WHICH separating by etnicity does not improve it!! (After seeing homogeneity, they use Epilog to calculate pooled OR and CI with both fixed and random effects models)
Lack of power (several clues pinpoint this option)
1.- most of pooled OR < 2 (needs high sample size, and for 8 of 11 replicated meta-analysis is significant)
2.- 24/25 show upward bias in OR (higher in 1st study than in their pooled OR)
3.- Winner's curse in 23/25
- Linkage seems not useful for complex diseases. Individual risk depends on genetics, environment, lifestyle, stochastic factors. Allelic architecture refers to the number of distinct alleles that impact disease susceptibility and their allelic frequencies (together to their penetrances). They model to see the degree of allelic identity and the frequency of disease alleles (S). If CD/CV true, association mapping is good. But, if allelic identity is low (high allelic heterogeneity), association mapping is difficult. Mendelian have large allelic heterogeneity (haemophilia B and CFTR). For complex disease genes, few successes, either allelic heterogeneity is high or due to lack of power high-penetrant ones have been identified. They make a model with a rate of passing from neutral to disease allele, the reverse (usually lower) and the frequency of the disease allele; together to some weak purifying selection (they do not affect the fitness so much). The mutation rate to disease alleles is the key (big differences between Pritchard's and Risch's). They do constant population size.
- They put beta(s) = 4Ne(s) from 0.1 to 5 while Risch puts beta(s) = 0.13. They find that most of alleles get extreme allelic frequencies (0 or 1). Therefore, many genes could affect, but they have low variability...do, few genes, but with not intermediate frequencies. To get intermediate frequency, a gene needs weak purifying selection and large mutation rate. Conclusion: some genes never seen with current methods because have low-frequency disease variants (so, our findings are biased). Allelic heterogeneity (affecting association studies) depends on beta(s)= mutation rate to susceptibility alleles. If beta= 0.1, the most common disease allele represents 94% of disease alleles of a gene, but if beta goes to 1 or 5 it descends to 65 or 33%.
- "Many Mendelian disorders display extreme levels of allelic heterogeneity that would doom LD-based techniques for finding low-penetrance complex disease variants. The CDCV hypothesis represents the best-case scenario for large-scale association mapping, so it is important to ask whether this hypothesis is unreasonably optimistic. In particular, why should the architecture of complex disease loci differ from that of Mendelian loci? Mendelian disease mutations are highly penetrant, and usually under very strong selection, which keeps them at low frequencies. Susceptibility variants involved in complex diseases seem to have low or medium penetrance, and are probably not subject to such strong selection. This suggests one crucial difference: it is possible for the class of S alleles at a complex disease locus to reach intermediate frequencies (5-10% or more). If current mutation rate predictions for complex disease loci are in the right range (7,8), then it seems that loci with intermediate p are not expected to have devastating levels of allelic heterogeneity."
- "The NOD2 example provides a good cautionary tale for would-be mappers of complex disease genes (24,25). Several features contributed to the success in finding this gene despite the presence of moderate allelic heterogeneity: the mutations had relatively high penetrance (24), the gene made good biological sense, and the susceptibility variants included a frameshift mutation and several non-synonymous mutations. But would our current mapping strategies be able to find a gene with similar allelic architecture under less favourable circumstances-for example, with lower penetrance, in a gene of unknown function, with multiple non-coding variants?"
1.- We argue that this view of adaptation that proceeds by selective sweeps at key loci is too limited. By contrast, classical selection models of artificial and natural selection in the quantitative genetics literature emphasize the importance of modest changes in allele frequencies at many loci (that is, 'polygenic adaptation'). We believe that polygenic models have received too little attention in the population genetics community
2.- Suppose that a population is well-adapted to its environment until the environment changes (for example, the population moves to a new location with a different climate), thereby changing the optimal value of one or more phenotypes. If there is already considerable heritable variation underlying the phenotype in question, then the population can adapt rapidly to the new conditions at a speed that depends on the strength of selection and the heritability of the trait
The key point is that we should expect such an adaptation to occur by small allele frequency shifts spread across many loci. As a hypothetical example, consider the adaptation of human height (a trait for which there are probably hundreds of SNPs that each affect height by a few millimeters). Strong selection for increased height could be very effective, as height is extremely heritable. But at the level of individual SNPs, the effect of selection would be rather weak, exerting just a small upward pressure in favour of each of hundreds of 'tall' alleles. Suppose that at 500 SNPs, the tall alleles each increase the expected height of a person by 2 mm. Then, an average shift of just 10% in the population allele frequency of each tall allele would increase average height in the population by 20 cm (assuming that SNPs contribute additively)
3.- If polygenic adaptation is indeed an important mode of adaptation, then what types of trait would be most likely to evolve in this way? Traits that are strongly heritable prior to selection (that is, that have standing variation for selection to act on) with many loci of small effect, are likely to evolve by the polygenic model. These conditions probably hold for most quantitative traits.