Subset Options

Whenever analysing a binary phenotype, unless all samples have been genotyped together, we advise use of the subset options to guard against the dangers of differential genotyping.  A basic use is to instruct LDAK to calculate the LD (correlation) between each pair of SNPs separately for each subset (e.g., cases and controls), then use whichever value is higher.  For a fuller explanation of why we advise this and what this entails, see the explanation below. The two options are:

–subsets <number> – states the number of subsets (normally these are separately genotyped subsets of samples).
–subset-prefix <prefix> – provides the prefix for the sample lists.

Note, that I have found using the subset options can slow down calculation of weightings, so consider increasing the number of iterations or runtime.

For example, considering the phenotype stored in phen_binary.pheno, suppose the cases and controls in the test dataset were genotyped separately. First make files sub1 and sub2, which contain the control samples (phenotype=1) and case samples (phenotype=2). Then slightly edit the example commands when calculating weightings:

../ldak.out –cut-weights sectionsB –bfile test
../ldak.out –calc-weights sectionsB –section 1 –bfile test –subset-number 2 –subset-prefix sub

../ldak.out –calc-weights sectionsB –section 2 –bfile test –subset-number 2 –subset-prefix sub
../ldak.out –join-weights sectionsB
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The Explanation:

Quality control (QC) is very important in heritability analysis, more so than with association studies. In the latter, we care predominantly about preventing false positives (spurious associations strong enough to achieve (genome-wide) significance). With heritability analysis, where the h2 estimate essentially represents the sum of associations across all SNPs, even small spurious associations, if sufficiently widespread, can accumulate to greatly inflate the estimate of h2.

In particular, heritability analysis can be very sensitive to genotyping errors. This is most relevant in case-control studies when cases or controls (or subsets thereof) have been genotyped separately, as then genotyping errors will almost certainly correlate with outcome and produce spurious associations.  We have observed that our weightings can exaggerate the inflation caused by genotyping errors. Poorly genotyped SNPs will typically appear as low-LD SNPs, so will receive higher weighings, and thus be given more emphasis in the mixed model analysis than when using a non-weighted kinship matrix. Our recommended solution is to use the Subset Options to tell LDAK which samples are cases and which are controls. The first process in calculating weightings is to compute correlations between pairs of SNPs to assess the (local) levels of LD. The Subset Options will instruct LDAK to calculate these correlations first over cases, then over controls, then take forward the maximum value observed. In this way, even if a SNP has been poorly genotyped in one set of samples, leading to low correlations with neighbouring SNPs, provided it has been well genotyped in the other, the correct pattern of correlations should be recovered, and more appropriate weightings obtained. Although subsetting introduces an approximation, we have shown through simulation the effect of this approximation to be slight.

If samples have been genotyped in more than two sets, then more than two subsets can be used. However, normally this will not be necessary, as it is differential genotypings between cases and controls that we care most about, as these will directly affect the heritability estimation. Similarly, subsetting could be used with a continuous phenotype, again telling LDAK which subsets of SNPs were genotyped separately. However, this will only be necessary in extreme circumstances, for example, if higher phenotype samples have been genotyped separately to lower phenotype samples, as otherwise it will be unlikely that differential genotyping will correlate with outcome.

Note: even when using Subset Options, very strict quality control remains necessary. The features on this page try to prevent the weightings exaggerating inflation of heritability estimates caused by genotyping errors, but all steps should be taken to minimise the genotyping errors in the first place.