Whether differences in intelligence are due to people’s different genes or to their different environments has long been contentious. One answer to this question comes from twin studies and adoption studies. By comparing outcomes for identical twins (who share all their genes) with those of fraternal twins and with unrelated children, one can deduce the relative influences of genes in comparison with “shared environment” (all environmental factors shared by siblings growing up together) and un-shared environment (everything else, which can include things like randomness in embryonic development). Such studies give high estimates for the genetic contribution to differences in intelligence, such that the heritability of IQ is typically estimated as around 70%.
A different method is to look directly at genes, through Genome Wide Association Studies (GWAS), which sample large numbers of genes in large numbers of people, attempting to measure and add up the affect of each gene on IQ. This typically gives much lower estimates for the effect of genes, and the marked difference between estimates from twin studies and those from GWAS studies is referred to as the “missing heritability” problem.
Recently the Harvard geneticist Sasha Gusev argued that twin studies are unreliable and that the true heritability is nearer the much-lower estimates from current GWAS studies. Saying that “intelligence is not like height”, he argues that, while a trait like height might be strongly influenced by genes, intelligence is not. “Adding up all of the genetic variants only predicts a small fraction of IQ score”, he says, adding that: “the largest genetic analysis of IQ scores built a predictor that had an accuracy of 2–5% in Europeans […]”.
In response to Gusev’s critique, Noah Carl wrote a defense of twin studies. Here I add to that by arguing that current GWAS studies must be overlooking much of the genetic influence on intelligence. In short, intelligence must be affected by vast numbers of genes, which means that most of them must have very small effects, and current GWAS studies do not have the statistical power to detect small-enough effects. This is not a new suggestion (see, e.g., Yang et al. 2017), but it could well resolve the issue.
Being taught at school about Mendel and smooth versus wrinkly peas might leave the impression that traits can be determined by only one or a few genes. While this might be true in some few cases, most traits are affected by very many genes, and, in particular, complex traits related to human personality and behaviour must involve huge numbers of genes. (Whereas a simpler trait like height could, in principle, involve fewer genes.)
Intelligence is among the most complex traits, which means that any genetic recipe for intelligence must contain a lot of information. If I asked you how many lines of code you’d need to program an intelligent robot you’d reply: “Eek, millions at least!”. Genes provide, of course, a developmental recipe rather than a direct program, but the underlying point, that this recipe must contain a vast amount of information and so be encoded in tens of thousands of genes, still stands. It then follows that most of these genes must individually be having a very small effect. If N thousand genes each contributed equally to intelligence then each would have an 1/N thousandth effect.
A basic rule of statistics is that to find smaller effects you need a larger sample. Typically the uncertainty scales as the square root of the sample size. So if an opinion poll samples 1000 people then you get a 3% error range [square-root(1000)/1000]. To do ten times better (a 0.3% error) you’d need a sample 100 times bigger. (Though you’d also run into systematic error, such as whether your sample is representative of the population.) And, of course, to find a tiny effect you need an error range smaller than that effect, preferably quite a bit smaller.
In writing about GWAS studies I should own up that I am not a geneticist, but in my “day job” in astrophysics we have exactly the same problem, that’s why we build telescopes with large mirrors to collect many photons. We are studying tiny signals from faint galaxies at large distances in the universe, and every time we apply for time on a large telescope we calculate how much time we need in order to collect enough photons to have enough statistical power to find the small signal that we are looking for.
GWAS studies examine one type of genetic variability, Single Nucleotide Polymorphisms (or SNPs, usually pronounced “snips”), and they might typically record SNPs at 20,000 locations out of 3 billion nucleotides, examining those SNPs over a sample of 10,000 to 100,000 genomes. For a discussion of the statistical power of GWAS studies I refer to the paper Wu et al 2022 (“… Statistical power … of genome-wide association studies of complex traits”). The authors confirm that: “The statistical power for an individual SNP is determined by its effect size, the sample size, and the desired significance level. In a random sample of size n, the test statistic for the association between a quantitative phenotype and a SNP is β sqrt(n), …” (where β is the effect size). Making a range of assumptions (for which see the paper) they develop a model leading to the following plot:

The plot shows the sample size (number of genomes) needed to find SNPs of small, moderate and large effect size, by which they mean 0.01%, 0.1%, and 1% of total SNP heritability. This shows that to detect a gene accounting for 0.1% of the variance requires a sample size of ~ 30,000. A similar paper that again develops a statistical model, making a different set of assumptions (Wang & Xu 2019), concludes that finding a SNP that explains 0.20% of the phenotypic variance requires a sample size of 10,000, which is consistent with the first paper. It’s also worth remarking that real-life studies will almost certainly do worse than these estimates, since there are always sources of noise not accounted for in theoretical models.
Hence GWAS studies sampling tens of thousands of genomes could find the genes associated with intelligence if there were only a few hundred of them, but if they number a few thousand then that requires hundreds of thousands of genomes at a minimum, and if intelligence involves over ten thousand SNPs then that’s well beyond current GWAS studies.
The GWAS study in the quote from Gusev above (Savage etal 2018) sampled the genomes of 270,000 people. They do indeed report that the genetic differences that they found account for only “up to 5.2% of the variance in intelligence”, but they have only found 205 “associated genomic loci” (blocks of SNPs associated with a trait). It seems wildly implausible that these 205 genetic differences are all that there is to a recipe for intelligence. (If you disagree, feel free to write down a developmental recipe for human-like intelligence in only 205 lines of instructions!) It’s worth pointing out that AI models such as ChatGPT-4 are based on hundreds of billions of neural-network “weights” (though of course this is the end product, not the recipe).
Indeed a more recent and bigger study (Okbay et al 2022) analysed 3 million genomes and found 3,952 SNPs associated with educational attainment, that together account for 12 to 16% of variance in educational attainment. So (setting aside that intelligence, IQ and educational attainment are not quite the same thing), they have a larger sample, they find that many more SNPs are involved, and in total this accounts for a larger fraction of the variation.
Even then, this is likely to be merely the tip of the iceberg of genetic variability relevant to intelligence and educational attainment. There are 3 billion nucleotides in the human genome, and we have no good way of estimating what fraction might have some effect on intelligence. Further, GWAS techniques study only one type of genetic variation, the SNP, whereas there are, in principle, lots of other ways in which genomes can vary. And, further, GWAS estimates assume a simple “additive” model, where the overall effect on the phenotype is simply the sum of the effect of each SNP individually. This could well be a good first-order approximation, but the reality of a recipe for intelligence is likely to involve myriads of subtle and complex interactions.
In short, since our understanding of the genetic developmental recipe for intelligence is close to non-existent, and since we are only guessing wildly at how many SNPs it might involve and how those SNPs interact, there is no way that we can conclude that current GWAS studies are sensitive to most of the relevant genetic variability. Hence we cannot conclude that adding up the known effects gives anything like a true estimate of the heritability of any complex trait, such as intelligence. All we can say is that it gives a lower limit.
Note that this argument is not attributing the missing heritability (as has sometimes been suggested) to relatively rare genes of large effect, which, being rare, are simply not sampled in GWAS studies (this idea used to be plausible, but is getting less so as GWAS studies get bigger). Instead, it is attributing the missing heritability to large numbers of common genes that are sampled in GWAS studies, but whose individual effects are too small to be detected with the statistical power of current GWAS studies.
In contrast, twin studies do not depend on knowing anything about how the genes produce intelligence. It is purely a suck-it-and-see method that takes whole genomes (identical twins, fraternal twins, and unrelated children) and evaluates the later-life outcomes. Twin studies do have their own assumptions, including the “equal environment assumption”. For example, do parents tend to treat identical twins differently from fraternal twins? This is one reason why the gold standard is studies of twins that were separated at birth and reared apart. This is a rare occurrence today, but before the ready availability of abortion and the pill in Western countries there was a steady stream of young, un-married mothers giving up babies to be adopted at birth, and it was common to separate twins. More recently, China’s one-child policy has led to twins being separated and adopted at birth (e.g., Segal & Pratt-Thompson 2024). Such studies give high values of ~ 0.7 for the heritability of IQ.
As a result of checks like this heritability estimates from twin and adoption studies have been extensively examined and seem robust. We have no good reason to think that twin studies are severely underestimating the heritability of IQ. In contrast, there is good reason to suspect that the GWAS estimates are only a lower limit, and are currently much too low.
