There are two main reasons to thin predictors. The first is when constructing a kinship matrix in order to assess population structure or relatedness (see Quality Control for more details). Here, we recommend performing a strong thinning, in order to obtain a subset of predictors in approximate linkage equilibrium (e.g., so that there are no predictors within 1cM with squared correlation above 0.05). This is because we are interested in genome-wide correlations between predictors (caused by population structure and relatedness), and not local correlations (caused by linkage disquilbrium).
The second reason for thinning predictors is when wishing to implement the LDAK-Thin Model, our recommended Heritability Model when analysing individual-level or non-human data. Here we perform a light thinning, with the aim only of ensuring no duplicate predictors remain (e.g., so that there are no predictors within 100kb with squared correlation above 0.98).
Note that if you would like to thin only the significant predictors (e.g., to when processing the results from Association Testing, you should instead use Clumping.
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --thin <output>.
This requires the options
--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).
--window-prune <float> - to specify the correlation squared threshold.
--window-cm <float>, --window-kb <float> or --window-length <integer> - to specify the window size (how far to search for correlated predictors, where the units are centiMorgans, kilobase or number of predictors, respectively). Note that --window-length -1 will tell LDAK to consider all predictors on the same chromosome.
By default, when LDAK finds highly-correlated pairs of predictors, it will remove one at random. However, if you use the option --pvalues <pvalues>, then LDAK will exclude the less significant predictor (this option is used when Clumping).
The lists of retained and discarded predictors are saved in the files <output>.in and <output>.out, respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
When thinning in order to identify a subset of predictors in approximate linkage disequilibrium, we recommend using --window-prune 0.05 and --window-cm 1. However, you should change these values if too few or too many predictors remain (when analysing human SNP data, we usually aim for between 50,000 and 100,000 SNPs). When thinning in order to implement the LDAK-Thin Model, we recommend using --window-prune 0.98 and --window-kb 100. While this will likely fail to capture a few duplicate predictors (those more than 100kb apart), the impact of this on subsequent analyses is usually negligible.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam from the Test Datasets.
To thin to obtain a subset of predictors in approximate linkage equilibrium, use
../ldak.out --thin le --bfile human --window-prune 0.05 --window-cm 1
The lists of predictors that remain after thinning will be saved in le.in.
To thin duplicate predictors in preparation for using the LDAK-Thin Model, use
../ldak.out --thin thin --bfile human --window-prune 0.98 --window-kb 100
The lists of predictors that remain after thinning will be saved in thin.in.