Please note, we no longer recommend using MultiBLUP, therefore this page exists only for completeness.
To run (basic) Adaptive MultiBLUP, there are six steps.
Step 1: divide the genome into chunks --cut-genes <folder>
Step 2: test the chunks for association --calc-genes-reml <folder>
Step 3: create regions based on significant chunks --join-genes-reml <folder>
Step 4: calculate the background kinship matrix --calc-kins-direct <output>
Step 5: estimate variance components for the MultiBLUP Model --reml <output>
Step 6: estimate predictor effect sizes and calculate predictions --calc-blups <output>
Steps 1 and 3 are almost instantaneous, while Steps 2 and 6 are very fast (typically minutes rather than hours, and Step 2 can be parallelised). For large data, Step 4 can be parallelised by following the instructions in Calculate Kinships. When Step 3 finds a large number of regions, it often suffices to omit the background kinship matrix from Steps 5 & 6, meaning that Step 4 can be ignored (see example below).
Use the option --covar <covarfile> to provide covariates (in PLINK format) in either Step 2 or Step 5. Use --keep <keepfile> and/or --remove <removefile> to consider only a subset of samples; this is useful if wishing to fit the model on training samples, then measure predictive accuracy for test samples.
As we discuss here, we advise not using SNP weightings (i.e. in Steps 2, 4 or 5). However, there may be benefits including them in Step 2 to improve resolution and reduce computational burden.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 1: divide the genome into chunks. For human data, we suggest chunks of size 75,000 base pairs.
./ldak.out --cut-genes chunks --chunks-bp 75000 --bfile human --ignore-weights YES
Alternatively, you can divide by SNP weights, number of SNPs, gene annotations, or specify your own divides (see Gene-based Analysis for the possible options). The chunks will be allocated to PARTITIONS (by default 1, but you can create more using --partition-length <integer>).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 2: test the chunks for association.
./ldak.out --calc-genes-reml chunks --bfile human --pheno quant.pheno --ignore-weights YES --power -1 --partition 1
If you have more than one PARTITION, repeat this step for the other ones. Typically, you would use --covar <covarfile> to include, say, 5 top principal components as covariates to allow for moderate dataset heterogeneity. However, sometimes this will not be sufficient (i.e., if there is high relatedness or distinct subpopulations), in which case use --grm <kinfile> to include a kinship matrix; this approach is the multi-SNP equivalent of single-SNP mixed model association testing, where a kinship matrix is included to account for confounding. This will slow down analysis considerably, so it is advisable to increase the number of PARTITIONS created in Step 1, so that this step can be parallelised.
Instead of --ignore-weights YES, you could use --weights <weightsfile>, where <weightsfile> contains the LDAK weightings (for this step only). This might identify important chunks with better resolution, and will have computational advantages, although it requires you to first Calculate Weightings.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 3: identify significant chunks. For human data, we suggest significance thresholds of 1e-5 and 0.01, else you can set sig1 equal to 0.05 Bonferroni corrected for the number of chunks.
./ldak.out --join-genes-reml chunks --bfile human --sig1 1e-5 --sig2 0.01
Regions will be formed by merging each chunk with P<1e-5 with all adjacent chunks with P<0.01. These will be saved as chunks/region.1, chunks/region.2, ..., chunks/region.X. The number of regions (X) will be saved in chunks/region.number, while the list of background SNPs will be saved in chunks/region.0.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 4: calculate the background kinship matrix.
./ldak.out --calc-kins-direct chunks/region.0 --bfile human --extract chunks/region.0 --ignore-weights YES --power -1
See Calculate Kinships, for instructions on how to parallelize this Step. If Step 3 produces many regions (say 5 or more), you can probably skip Step 4 and exclude the background kinship matrix from the MultiBLUP model with little consequence. However, if in doubt, it is probably best to keep it in.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 5: estimate variance components for the MultiBLUP Model.
X=`awk < chunks/region.number '{print $1}'`
./ldak.out --reml mb --grm chunks/region.0 --pheno quant.pheno --region-number $X --region-prefix chunks/region. --bfile human --ignore-weights YES --power -1
The first line reads the number of regions from chunks/region.number, and saves this as the variable X. If you decided to skip Step 4, then you should exclude --grm chunks/region.0 when running this step. The variance components will be stored in mb.reml (with estimates of random effects in mb.indi.blp and regional effect sizes in mb.reg.blup).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 6: estimate SNP effect sizes and calculate predictions.
./ldak.out --calc-blups mb --grm chunks/region.0 --remlfile mb.reml --bfile human
If you decided to skip Step 4, then you should exclude --grm chunks/region.0 when running this step. The effect size estimates will be stored in mb.blup.