Here we focus on heritability analyses that use individual-level data (for analyses using summary statistics, see the SumHer pages).

Please note that heritability analyses require very careful quality control. In particular, estimates can be very sensitive to genotyping errors, and inflated by familial relatedness or population structure. Therefore, you should begin by reading Quality Control, which provides guidelines for cleaning your dataset, as well as ways to test for inflation due to genotyping errors or cryptic relatedness.

Furthermore, most heritability analyses require a large number of unrelated samples. For example, to reliably estimate SNP heritability (standard deviation less than 5%) typically needs at least 7,000 unrelated samples. If after performing quality control, your dataset contains fewer than 5000 unrelated samples, then you will not be able to perform heritability analysis, sorry. However, you may instead be able to perform single-SNP or gene/chunk-based association analysis. See Small Datasets for more details.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Having performed quality control, the first step is to Calculate Kinships. The number of kinship matrices, and how each is calculated, is determined by the choice of Heritability Model.

The second step is to estimate the heritability contributed by each kinship matrix. For this, you can use restricted maximum likelihood (REML), Haseman Elston (HE) regression or phenotype-correlation, genotype-correlation (PCGC) regression. REML and HE can be used for both quantitative and binary phenotypes, however PCGC can only be used for binary phenotypes.

REML is generally preferred because it produces the most precise estimates of heritability (those with the smallest standard deviation). Furthermore, you can subsequently obtain estimates of SNP effect sizes via BLUP (Best Linear Unbiased Prediction).

However, if you have many samples (say, over 30,000), you should consider instead using HE regression (if the phenotype is quantitative) or PCGC regression (if the phenotype is binary). These two methods are less computationally intensive than REML (especially when there are multiple kinship matrices). Moreover, because they make fewer assumptions than REML (in particular, they do not assume effect sizes are Gaussian), they will tend to produce less-biased estimates. Furthermore, it has been shown that for binary traits, PCGC should be preferred to REML when covariates explain a substantial proportion of phenotype variation (for example, if analysing a disease for which sex is an important risk factor).