Sample Subsets

When samples originate from different cohorts, the sample subsets are used to specify which samples each cohort contains. Their original use was to safeguard against genotyping errors when calculating LDAK Weightings (for details, see below). They can now also be used when estimating heritability using Haseman-Elston or PCGC Regression, in which case, LDAK will additionally estimate heritability using only pairs of samples in the same cohort, and using only pairs of samples in different cohorts.

To provide sample subsets, use --subset-number <number> to specify the number of subsets, and --subset-prefix <subprefix> to specify the prefix for the sample lists
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotypes quant.pheno and binary.pheno, and the lists of samples ind1 and ind2 from the Test Datasets. We also use the kinship matrix with stem HumDef created in the example for Calculate Kinships.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To allow for sample subsets when calculating weightings, use the commands

./ldak.out --cut-weights sections2 --bfile human
./ldak.out --calc-weights-all sections2 --bfile human --subset-number 2 --subset-prefix ind

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To allow for sample subsets when estimating SNP heritability for quant.pheno using Haseman-Elston Regression, run

./ldak.out --he he4 --pheno quant.pheno --grm HumDef --subset-prefix ind --subset-number 2

To allow for sample subsets when estimating SNP heritability for binary.pheno using PCGC Regression, run

./ldak.out --pcgc pcgc4 --pheno binary.pheno --grm HumDef --prevalence .01 --subset-prefix ind --subset-number 2
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The Explanation:

Quality control is very important in heritability analysis, more so than with association studies. In the latter, we care predominantly about preventing false positives (spurious associations strong enough to achieve genome-wide significance). With heritability analysis, where each heritability estimate in effect measures the total association across a large number SNPs; even small spurious associations, if sufficiently widespread, can accumulate to greatly inflate the estimates.

In particular, heritability analysis can be very sensitive to genotyping errors. This is most relevant in case-control studies when cases or controls (or subsets thereof) have been genotyped separately, as then genotyping errors will almost certainly correlate with outcome and produce spurious associations. We have observed that the LDAK weightings can exaggerate the inflation caused by genotyping errors. Poorly genotyped SNPs will typically appear as low-LD SNPs, so will receive higher weight, and thus be given more emphasis than when using a non-weighted kinship matrix. Our recommended solution is to use the sample subsets to tell LDAK which batches of samples were genotyped together. The first process when calculating weightings is to compute correlations between pairs of SNPs to assess (local) levels of LD. When provided with sample subsets, LDAK will calculate these correlations for each batch of samples separately, then take forward the maximum value observed. In this way, even if a SNP has been poorly genotyped in one batch, leading to low correlations with neighbouring SNPs, provided it has been well genotyped in another, the correct pattern of correlations should be recovered. Although using sample subsets introduces an approximation, we have found in simulations that the effect of this approximation is slight.

Note that even when using sample subsets, very strict quality control remains necessary. Using subsets when calculating weightings will prevent exaggeration of the inflation of heritability estimates caused by genotyping errors, but all steps should be taken to minimise the genotyping errors in the first place.