Adaptive MultiBLUP

To run (basic) Adaptive MultiBLUP, there are six steps.

Step 1: divide the genome into chunks --cut-genes <folder>
Step 2: test the chunks for association --calc-genes-reml <folder>
Step 3: create regions based on significant chunks --join-genes-reml <folder>
Step 4: calculate the background kinship matrix --calc-kins-direct <output>
Step 5: estimate variance components for the MultiBLUP Model --reml <output>
Step 6: estimate predictor effect sizes and calculate predictions --calc-blups <output>

Steps 1 and 3 are almost instantaneous, while Steps 2 and 6 are very fast (typically minutes rather than hours, and Step 2 can be parallelised). For large data, Step 4 can be parallelised by following the instructions for Getting Kinships. When Step 3 finds a large number of regions, it often suffices to omit the background kinship matrix when performing REML (Step 5), meaning that Step 4 can be ignored.

Covariates can be included by adding --covar in either Step 2 or Step 5 and use --keep to consider only a subset of samples (for example, if wishing to fit the model on a training sample).

As we discuss here, we advise not using SNP weightings (i.e. in Steps 2, 4 or 5). However, there may be benefits including them in Step 2 to improve resolution and reduce computational burden.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Suppose we wish to perform Adaptive MultiBLUP using the binary PLINK files test.bed, test.bim and test.fam and the phenotypic values in phen.pheno provided in the Test Datasets.

Step 1: divide the genome into chunks. For human data, we suggest chunks of size 75,000 base pairs.

../ldak.out --cut-genes chunks --chunks-bp 75000 --bfile test --ignore-weights YES

Alternatively, you can divide by SNP weights, number of SNPs, gene annotations, or specify your own divides. See Gene-based Analysis for the possible options. The chunks will be allocated to PARTITIONS each corresponding to, by default, 1,000,000 predictors.

Step 2: test the chunks for association.

../ldak.out --calc-genes-reml chunks --bfile test --pheno phen.pheno --ignore-weights YES --partition 1

(This step generates many errors about predictors with too many missing values, so avoid these by performing better QC than me!) If you have more than one PARTITION, repeat this step for the other ones.

Typically, you would include, say, 5 top principal components as covariates to allow for moderate dataset heterogeneity. However, sometimes this will not be sufficient (i.e., if there is high relatedness or distinct subpopulations), in which case use --grm to include a kinship matrix; this approach is the multi-SNP equivalent of single-SNP mixed model association testing, where a kinship matrix is included to account for confounding. This will slow down analysis considerably, so it is advisable to increase the number of PARTITIONS using --partition-length and then parallelise.

In this example, instead of --ignore-weights YES you could use --weights sections_save/weightsALL (for this step only). This might identify important chunks with better resolution, and will have computational advantages, although requires first calculating weightings for all predictors.

Step 3: identify significant chunks. For human data, we suggest significance thresholds of 1e-5 and 0.01, else you can set sig1 equal to 0.05 Bonferroni corrected for the number of chunks.

./ldak.out --join-genes-reml chunks --bfile test --sig1 1e-5 --sig2 0.01

Regions will be formed by merging each chunk with P<1e-5  with all adjacent chunks with P<0.01. These will be saved as chunks/region1, chunks/region2, ..., chunks/regionX. The number of regions (X) will be saved in chunks/region_number, while the list of background SNPs will be saved in chunks/region0.

Step 4: calculate the background kinship matrix.

./ldak.out --calc-kins-direct chunks/region0 --bfile test --extract chunks/region0 --ignore-weights YES

If you wish to parallelise this step, use --cut-kins, --calc-kins and --join-kins, as described in Get Kinships, and in Steps 5 and 6 use the joined kinship matrix in place of chunks/region0.

If Step 3 produces many regions (say 5 or more), you can probably skip Step 4 and exclude the background kinship matrix from the MultiBLUP model with little consequence. However, if in doubt, probably best to keep it in.

Step 5: estimate variance components for the MultiBLUP Model.

./ldak.out --reml out --grm chunks/region0  --pheno phen.pheno --region-number X --region-prefix chunks/region --bfile test --ignore-weights YES

where in this example X=2, the number of regions detected in Step 3. The variance components will be stored in out.reml (with estimates of random effects in out.indi.blp and regional effect sizes in out.reg.blup).

Step 6: estimate SNP effect sizes and calculate predictions.

./ldak.out --calc-blups out --grm chunks/region0 --remlfile out.reml --bfile test

The effect size estimates will be stored in out.blup, with predictions for all samples contained in out.pred.

Including the background kinship matrix will greatly increase the number of predictors in the prediction model, but when using a few regions will often only modestly improve prediction accuracy (in this example, the matrix is assigned variance zero in Step 5, so does not contribute at all). Therefore, when Step 3 identifies one or more significant local regions, it is typically worth ignoring the background kinship matrix (ignoring Step 4 and omitting --grm chunks/region0 in Steps 5 and 6).