When performing a Single-Predictor Analysis, you will typically find clumps of significant predictors. This reflects that nearby predictors are highly correlated due to linkage disequilibrium. The same is true (albeit to a lesser degree), when performing a multi-predictor analysis (i.e., a Gene-Based Analysis or LDAK-GBAT). Clumping can then be used to decide which of the significant loci in each cluster are most likely to be causal, and to estimate the total number of causal loci.
Note that when clumping it is necessary to specify the window size and the pruning threshold. There is no consensus regarding the best choices of these (different GWAS will use different values). When clumping the results of a single-predictor analysis, we recommend that you filter so that no pair of predictors within 1cM has squared correlation greater than 0.05 (if genetic distances are not available, you can use 1000kb instead of 1cM). Meanwhile when clumping the results of a gene-based analysis, we recommend that you filter so that no pair of genes on the same chromosome has squared correlation greater than 0.1.
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --thin-top <output>
This requires the options
--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats). Note that when clumping results from a gene-based analysis, you will have to add --SNP-data NO (else LDAK will complain that predictor values are not between 0 and 2).
--window-prune <float> - to specify the correlation squared threshold.
--window-cm <float>, --window-kb <float> or --window-length <integer> - to specify the window size (how far to search for correlated predictors, where the units are centiMorgans, kilobase or number of predictors, respectively). Note that --window-length -1 will tell LDAK to consider all predictors on the same chromosome.
--pvalues <pvalues> - to provide p-values for each predictor (the file <pvalues> should have two columns, that provide predictor names then p-values). When LDAK finds two highly correlated predictors, it will discard the one with the highest p-value.
--cutoff <float> - to provide the p-value threshold (LDAK will only consider predictors with p-values below this threshold).
The lists of retained and discarded predictors are saved in the files <output>.in and <output>.out, respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
When thinning results from single-predictor analysis, we recommend using --window-prune 0.05 and --window-cm 1 (or --window-prune 0.05 and --window-kb 1000, if genetic distances are not available). When thinning results from a gene-based analysis, we recommend using --window-prune 0.1 and --window-length -1 (the latter will tell LDAK to consider all pairs of predictors on the same chromosome).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype quant.pheno, and the annotations file anns.txt from the Test Datasets.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
1 - Clumping results from a single-predictor analysis
First, we will perform a single-predictor analysis using the command
./ldak.out --linear quant --bfile human --pheno quant.pheno
The p-values from the analysis are saved in the file quant.pvalues. We then clump the results using the command
./ldak.out --thin-tops clump --bfile human --window-prune 0.05 --window-cm 1 --pvalues quant.pvalues --cutoff 5e-8
In total, there were six predictors with p-value less than 5e-8; after clumping, four of these remain (listed in the file clump.in).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2 - Clumping results from a gene-based analysis
First, we will perform a gene-based analysis using the commands
./ldak.out --cut-genes genes --bfile human --genefile anns.txt
./ldak.out --calc-genes-reml genes --pheno quant.pheno --bfile human --ignore-weights YES --power -.25
./ldak.out --join-genes-reml genes
The estimates of the genetic contributions of each gene are saved in the files genes/remls.all.sp, genes/remls.all.bim and genes/remls.all.fam, while the corresponding p-values are saved in genes/remls.all.pvalues. We then clump the results using the command
./ldak.out --thin-tops clump2 --sp genes/prs.all --SNP-data NO --window-prune 0.1 --window-length -1 --pvalues genes/prs.all.pvalues --cutoff 2.5e-6
Note that 2.5e-6=0.05/20000 is the Bonferroni significance threshold if testing 20,000 genes (the typical number of genes when analyzing genome-wide human data). In total, there were two predictors with p-value less than 2.5e-6; after clumping, one one of these remain (listed in the file clump2.in).