Subset Options

Whenever analysing a binary phenotype, unless all samples have been genotyped together, we advise use of the subset options to guard against the dangers of differential genotyping.  A basic use is to instruct LDAK to calculate the SNP-SNP correlations separately for each genotyping batch, then for each pair of SNPs, use only the highest recorded correr.  For a fuller explanation of why we advise this and what this entails, see the explanation below. The two options are:

--subsets <number> - states the number of subsets (normally these are separately genotyped subsets of samples).
--subset-prefix <prefix> - specifies the prefix for the sample lists.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example: for this we use the binary PLINK files test.bed, test.bim and test.fam available in the Test Datasets, as well as the binary phenotype phen_binary.pheno. Suppose the cases and controls in the test dataset were genotyped separately. First make files sub1 and sub2, which lists the controls (phenotype=1) and cases (phenotype=2):

awk < phen_binary.pheno '($3==1){print $1, $2}' > sub1
awk < phen_binary.pheno '($3==2){print $1, $2}' > sub2

Then slightly edit the example provided in Get weightings:

../ldak.out --cut-weights subtions --bfile test
../ldak.out --calc-weights-all subtions --bfile test --subset-number 2 --subset-prefix sub

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The Explanation:

Quality control (QC) is very important in heritability analysis, more so than with association studies. In the latter, we care predominantly about preventing false positives (spurious associations strong enough to achieve genome-wide significance). With heritability analysis, where the h2 estimate essentially represents the sum of associations across all SNPs; even small spurious associations, if sufficiently widespread, can accumulate to greatly inflate the estimate of h2.

In particular, heritability analysis can be very sensitive to genotyping errors. This is most relevant in case-control studies when cases or controls (or subsets thereof) have been genotyped separately, as then genotyping errors will almost certainly correlate with outcome and produce spurious associations.  We have observed that our weightings can exaggerate the inflation caused by genotyping errors. Poorly genotyped SNPs will typically appear as low-LD SNPs, so will receive higher weighings, and thus be given more emphasis in the mixed model analysis than when using a non-weighted kinship matrix. Our recommended solution is to use the Subset Options to tell LDAK which batches of samples were genotyped together. The first process when calculating weightings is to compute correlations between pairs of SNPs to assess the (local) levels of LD. The Subset Options will instruct LDAK to calculate these correlations for each batch of samples, then take forward the maximum value observed. In this way, even if a SNP has been poorly genotyped in one batch, leading to low correlations with neighbouring SNPs, provided it has been well genotyped in another, the correct pattern of correlations should be recovered. Although subsetting introduces an approximation, we have shown through simulation the effect of this approximation to be slight.

Note that even when using Subset Options, very strict quality control remains necessary. The features on this page try to prevent the weightings exaggerating inflation of heritability estimates caused by genotyping errors, but all steps should be taken to minimise the genotyping errors in the first place.