Quality Control

Prior to performing heritability analysis, it is necessary to perform very careful quality control of samples and predictors. This is because heritability analyses tend to be very sensitive to genotyping errors. Estimates of heritability represent the sum of the contributions of many predictors, so even if the error at individual predictors is small, this can accumulate across predictors to produce a large overall error. Further, estimates of heritability are sensitive to inflation due to population structure and familial relatedness; the former can result in correlations between predictors and the phenotype that are due to ancestry, not causality, while the latter leads to long-range linkage disequilibrium (meaning that we can no longer be confident that the heritability estimates reflect only the heritability contributed by the predictors in the dataset).

Quality control is also important when constructing prediction models. Genotyping errors, population structure and familial relatedness can all bias estimates of predictor effect sizes, meaning that the model will under perform when used to predict phenotypes for an independent set of samples.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Here we provide some guidelines for performing quality control. You can obtain many of the required metrics (i.e., allele frequencies and missing rates) by Calculating Statistics. Then we provide ways to test whether the quality control steps are sufficient. You should be aware that the thresholds mentioned below are generally suggestions, and you should check that they are sensible for your data. For example, when analysing the UK Biobank data, our information score threshold was 0.95. However, if in your data, almost all (very few) predictors exceed this threshold, then you should consider increasing (reducing) it. For some good advice on how to decide thresholds, see Mike Weale's Chapter on Quality Control.

Note that the example below shows that the Test Datasets contain ancestral outliers, as well as related samples. Ideally, the outliers should be excluded from all subsequent analyses, while the related samples should be excluded when estimating heritabilities (e.g., using REML, Haseman-Elston or PCGC Regression). However, for simplicity, most examples on this website instead use all samples (i.e., assume the data have already been subjected to quality control).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Filtering of samples:

Our aim is to obtain a set of ancestrally-homogeneous, unrelated samples. Here we explain how we did this when analysing UK Biobank data for our paper Evaluating and improving heritabilty models using summary statistics (Nature Genetics, 2020).

We first identified SNPs in the UK Biobank data that were common (MAF>0.01), high-quality (information score >0.95 and missingness <0.01) and autosomal (not located on a sex chromosome). We then calculated a kinship matrix using the 1000 Genomes Project data (restricting to the common, high-quality, autosomal UK Biobank SNPs). When Calculating Kinships, we assumed a thinned version of the GCTA Model, that uses only SNPs in approximate linkage equilibrium (see the example below). We then obtained the two principal axes of this matrix, and the corresponding SNP loadings (see Principal Component Analysis or the example below).

Next we projected the UK Biobank data onto these two axes (see Calculating Scores or the example below). This enabled us to identify the UK Biobank samples whose genotypes were consistent with those recorded as being "White British" (Supplementary Note 5 shows the projections and explains how we defined consistent). Using these samples, we created a UK Biobank kinship matrix, restricting to the same SNPs as above, and again assuming a thinned version of the GCTA Model. Finally, using this second kinship matrix, we removed UK Biobank samples until no pair remained with kinship greater than the absolute value of the smallest observed (this is the default when Filtering Relatedness).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Filtering of predictors:

Most heritability analyses are designed for common predictors (typically MAF>0.01 or MAF>0.005), as it is not currently clear how best to include rare variants. We also generally restrict to common predictors when constructing prediction models, as it is very difficult to accurately estimate effect sizes for rare predictors. Further, both heritability and prediction analyses will usually focus on autosomal predictors, as it is not straightforward to include sex chromosomes.

Our preferred measure of SNP quality is the information score from imputation, which reflects how closely a SNP's genotypes match those expected given the neighbouring SNPs. We typically exclude SNPs with information score below 0.95 or 0.99. While it is reasonable to also filter SNPs based on other metrics, such as missingness or Hardy-Weinberg Equilibrium, in general we find this is not necessary (at least for common SNPs). Note that some imputation software, such as IMPUTE2, compute multiple information scores, in which case we would filter based on all scores. Alternatively, given genotype probabilities, LDAK is able to compute its own information score, that represents the expected correlation between the recorded genotypes and the true genotypes (see Make Data).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Testing for inflation due to population structure and familial relatedness:

For each phenotype, we advise computing T1=(h2A+h2B+h2C+h2D-h2SNP)/3, where h2A, h2B, h2C and h2D are estimates of heritability from quarters of the genome, while h2SNP is an estimate from the whole genome. For example, if analysing human data, h2A, h2B, h2C and h2D could be estimates of heritability from Chromosomes 1-3, 4-7, 8-12 and 13-22, respectively.

If there is no population structure and the samples are unrelated, then the sum of estimates from each quarter should equal the estimate from the whole genome, and as a result, T1 will be close to zero. By contrast, if there is (substantial) structure or relatedness, then SNPs will be correlated across chromosomes. Therefore, each of h2A, h2B, h2C and h2D will capture not just the heritability contributed by SNPs in the corresponding quarter, but also a fraction of heritability contributed by SNPs in other quarters, and thus their sum will be greater than the heritability contributed by all SNPs.

The estimates of h2A, h2B, h2C, h2D and h2SNP can be obtained using either REML or Haseman-Elston Regression. To test whether T1 is significantly greater than zero, it is necessary to estimate its distribution. We do this empirically by repeatedly sampling h2A, h2B, h2C, h2D and h2SNP from their estimated distributions. Here is some sample R code :

#eSNP, eA, eB, eC, eD are the five estimates of heritability
#sSNP, sA, sB, sC, sD are the five corresponding standard deviations

N=100000
rSNP=rnorm(N,eSNP,sSNP);
rA=rnorm(N,eA,sA);
rB=rnorm(N,eB,sB);
rC=rnorm(N,eC,sC);
rD=rnorm(N,eD,sD);

T1samp=(rA+rB+rC+rD-rSNP)/3;
pvalue=mean(T1samp<0);
pvalue;

We find that estimates of T1 are not sensitive to the choice of Heritability Model, and therefore, it is easiest to assume the LDAK-Thin Model.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Testing for inflation due to genotyping errors:

This test is only possible when samples have been genotyped in distinct batches.

For each phenotype, we advise computing T2=h2Same-h2Diff, where h2Same and h2Diff are estimates of SNP heritability calculated, respectively, across pairs of samples in the same genotyping batch, and across pairs of samples in different genotyping batches. Note that if the phenotype is binary, then each batch must contain both cases and controls.

Genotyping errors will cause pairs of samples in the same batch to appear more genetically similar than they actually are, and pairs of samples in different batches to appear less genetically similar. This is problematic when pairs of samples in the same batch are more phenotypically similar than pairs of samples in different batches. In this case, h2Same will tend to be higher than h2Diff, and therefore T2 will be positive.

The estimates of h2Same and h2Diff can be obtained by adding --subset-number <integer> and --subset-prefix <subprefix> when performing Haseman Elston or PCGC Regression (see Subset Options for more details). As well as calculating T2, LDAK will also report whether it is significantly different to zero.

We have found that estimates of T2 are not sensitive to the choice of Heritability Model, and therefore, it is easiest to assume the LDAK-Thin Model.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim, human.fam, hapmap.bed, hapmap.bim and hapmap.fam, the phenotype quant.pheno, and the lists of samples ind1 and ind2 from the Test Datasets. We use the genetic data with prefix human as our main dataset (the one on which we perform quality control), while the data with prefix hapmap is a global dataset (which we use to infer ancestry).

Note that in this example, we only perform sample quality control. For real data, you should also perform predictor quality control (as described above).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Filter samples based on population structure.

First we make a kinship matrix using the global dataset, considering only SNPs present in the main dataset. When calculating kinships, we assume a thinned version of the GCTA Model, that restricts to SNPs in approximate linkage equilibrium.

awk < human.bim '{print $2}' > human.snps
./ldak.out --thin hapmap --bfile hapmap --window-prune .05 --window-cm 1 --extract human.snps
./ldak.out --calc-kins-direct hapmap --bfile hapmap --ignore-weights YES --power -1 --extract hapmap.in

The kinship matrix is saved with stem hapmap. Next we perform principal component analysis, and obtain SNP loadings for the first two axes.

./ldak.out --pca hapmap --grm hapmap --axes 20
./ldak.out --calc-pca-loads hapmap --pcastem hapmap --grm hapmap --bfile hapmap

The loadings are saved in the file hapmap.load. Finally, we project the main dataset onto these loadings

./ldak.out --calc-scores hapmap --scorefile hapmap.load --bfile human --power 0

The projections are saved in the file hapmap.profile. Click here to see a plot of the first two axes. Note that the lack of separation between ancestry clusters reflects that this a toy example, with very few SNPs. Nonetheless, there are ten clear outliers in the main dataset, that we can identify with the following command

awk < hapmap.profile '(NR>1 && ($5>.002||$7<-.01)){print $1, $2}' > outliers.ind

We will exclude the samples in outliers.ind from future analyses.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Filter samples based on relatedness.

We now make a second kinship matrix, the same as above, except using the main dataset (excluding the outlier samples)

./ldak.out --calc-kins-direct human --bfile human --ignore-weights YES --power -1 --extract hapmap.in --remove outliers.ind

We use this to filter based on relatedness.

./ldak.out --filter human --grm human

LDAK filters so that no pair of samples remains with estimated kinship greater than 0.23 (because the smallest observed kinship was -0.23). Note that for proper-sized datasets, this value will be much closer to zero (e.g., for the UK Biobank data, the smallest observed kinshp was -0.02). The files human.keep and human.lose list the 403 samples that were kept, and the 11 samples that were lost. When performing heritability analyses, we should use only the samples in human.keep.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Test for inflation due to structure and relatedness

Usually, we would create kinship matrices corresponding to quarters of the genome (e.g., Chromosomes 1-3, 4-7, 8-12 and 13-22), then compare the sum of estimates of heritability from each quarter (obtained by regressing the phenotype on each kinship matrix separately), to the estimate of heritability from all SNPs (obtained by regressing the phenotype on all four kinship matrices simultaneously). However, in this toy example, we only have data for Chromosomes 21 & 22, so instead we use only two kinship matrices, one for each chromosome. We will create kinship matrices assuming the LDAK-Thin Model. Therefore, we first create a weights file that gives weight one to the predictors that remain after thinning for duplicates

./ldak.out --thin thin.clean --bfile human --window-prune .98 --window-kb 100 --remove outliers.ind
awk < thin.clean.in '{print $1, 1}' > weights.thin.clean

Next we make kinship matrices for each chromosome

./ldak.out --calc-kins-direct chr21.clean --bfile human --weights weights.thin.clean --power -.25 --chr 21 --remove outliers.ind
./ldak.out --calc-kins-direct chr22.clean --bfile human --weights weights.thin.clean --power -.25 --chr 22 --remove outliers.ind

We estimate the heritability of each chromosome separately (using only the samples in human.keep)

./ldak.out --reml chr21.clean --grm chr21.clean --pheno quant.pheno --keep human.keep
./ldak.out --reml chr22.clean --grm chr22.clean --pheno quant.pheno --keep human.keep

Then we estimate the heritability of both chromosomes together

echo "chr21.clean
chr22.clean" > both.txt
./ldak.out --reml both --mgrm both.txt --pheno quant.pheno --keep human.keep

The heritability estimates are saved in the files chr21.reml, chr22.reml and chrboth.reml. We see that the sum of the estimates of heritability from Chromosomes 21 & 22 are 0.41 (SD 0.08) and 0.25 (SD 0.08), while the estimate from both chromosomes is 0.59 (SD 0.09). Therefore, the estimate of inflation is 0.40+0.25-0.59=0.07 (note that here we do not divide by three, as we are only using two kinship matrices). Using the R code above, we find this is not significantly greater than zero (P=0.3).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

4 - Test for inflation due to genotyping errors

We first compute a kinship matrix across all SNPs (as above, assuming the LDAK-Thin Model)

./ldak.out --thin thin.clean --bfile human --window-prune .98 --window-kb 100 --remove outliers.ind
awk < thin.clean.in '{print $1, 1}' > weights.thin.clean

./ldak.out --calc-kins-direct LDAK-Thin.clean --bfile human --weights weights.thin.clean --power -.25 --remove outliers.ind

For this example, we suppose that the samples in the file ind1 were genotyped separately to those in the file ind2. Therefore, we provide these subsets when estimating SNP heritability using HE Regression.

./ldak.out --he batch --pheno quant.pheno --grm LDAK-Thin.clean --subset-prefix ind --subset-number 2 --keep human.keep

The file batch.he shows that the estimate of the heritability contributed by the kinship matrix (calculated using all samples) is 0.53 (SD 0.15). The files batch.he.within and batch.he.across show that the estimated heritability is instead 0.58 (SD 0.19) and 0.49 (SD 0.20) if calculated using only samples in the same or in different cohorts. batch.he.compare shows that the difference between these two estimates is not significant (P=0.7).