Thin Predictors

There are two main reasons to thin predictors. The first is when constructing a kinship matrix in order to assess population structure or relatedness (see Quality Control for more details). Here, we recommend performing a strong thinning, in order to obtain a subset of predictors in approximate linkage equilibrium (e.g., so that there are no predictors within 1cM with squared correlation above 0.05). This is because we are interested in genome-wide correlations between predictors (caused by population structure and relatedness), and not local correlations (caused by linkage disquilbrium).

The second reason for thinning predictors, is when wishing to implement the LDAK-Thin Model, our recommended Heritability Model when analysing individual-level or non-human data. Here we perform a light thinning, with the aim only of ensuring no duplicate predictors remain (e.g., so that there are no predictors within 100kb with squared correlation above 0.98).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --thin <output>.

With this, you must provide

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--window-prune <float> - to specify the correlation squared threshold.

--window-cm <float>, --window-kb <float> or --window-length <integer> - to specify the window size (how far to search for correlated predictors).

When thinning in order to identify a subset of predictors in approximate linkage disequilibrium, we recommend using --window-prune 0.05 and --window-cm 1. However, you should change these values if too few or too many predictors remain (when analysing human SNP data, we usually aim for between 50,000 and 100,000 SNPs). When thinning in order to implement the LDAK-Thin Model, we recommend using --window-prune 0.98 and --window-kb 100. While this will likely fail to capture a few duplicate predictors (those more than 100kb apart), the impact of this on subsequent analyses is usually negligible.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam from the Test Datasets.

To thin to obtain a subset of predictors in approximate linkage equilibrium, use

../ldak.out --thin le --bfile human --window-prune 0.05 --window-cm 1

The lists of predictors that remain after thinning will be saved in le.in.

To thin duplicate predictors in preparation for using the LDAK-Thin Model, use

../ldak.out --thin thin --bfile human --window-prune 0.98 --window-kb 100

The lists of predictors that remain after thinning will be saved in thin.in.