Here we explain how to construct a prediction model using MegaPRS. These instructions assume that you are analysing summary statistics (if instead you are analysing individual-level data, we recommend you use Elastic-Predict). Please note that when running MegaPRS, it is no longer necessary to first compute Per-Predictor Heritabilities.

Note that the instructions on this page use pseudo cross-validation to determine suitable model parameters (to understand how this works, see Pseudo Summaries). However, if you have access to a validation dataset (i.e., an independent set of individuals for whom you have both genetic data and phenotypes), you can instead read the instructions for running MegaPRS Using a Validation Dataset.

Some prediction software recommend you exclude regions of high linkage disequilibrium (e.g., the MHC) when constructing models. This IS NOT NECESSARY when using MegaPRS (i.e., you can construct prediction models using all predictors).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Calculate predictor-predictor correlations:

The main argument is --calc-cors <outfile>.

The only required option is

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

By default, LDAK will set the window size (how far to search for pairs of correlated predictors) to 3Mb; you can change this using --window-kb <float> or --window-cm <float>.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easiest to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

The predictor-predictor correlations will be saved in files with prefix <outfile>; <outfile>.cors.bin contains the predictor-predictor correlations, (this is a binary file, so not human-readable), <outfile>.cors.root and <outfile>.cors.bim contain details of the datafiles, while <outfile>.cors.noise contains values used by LDAK to generate Pseudo Summaries.

To parallelize this process, you can add --chr <integer> to calculate correlations for each chromosome separately, then combine these with the argument --join-cors <output>, using the option --corslist <corsstems> to specify the per-chromosome correlations (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Construct the prediction model:

The main argument is --mega-prs <outfile>.

This requires the options

--model <type> - to specify the tool. MegaPRS contains Lasso-SS (--model lasso), Lasso-SS-Sparse (--model lasso-sparse), Ridge-SS (--model ridge), Bolt-SS (--model bolt), BayesR-SS (--model bayesr), BayesR-SS-Shrink (--model bayesr-shrink) and Elastic-SS (--model elastic). If you use --model mega, MegaPRS will use the data to decide which of Lasso-SS, Ridge-SS, BayesR-SS and Elastic-SS is most appropriate. In general, we recommend using Elastic-SS (i.e., using --model elastic).

--summary <sumsfile> - to specify the file containing the summary statistics.

--power <float> - to specify how predictors are scaled. In general, we recommend using --power -0.25.

--cors <corstem> - to specify the predictor-predictor correlations.

--high-LD <highldpredictors> - to specify predictors in regions of high linkage disequilibrium (LD). High-LD Regions explains how to identify predictors in high-LD regions when analysing human data. If there are no predictors in high-LD regions, you should use --check-high-LD NO. Please note that predictors in high-LD regions will be excluded only when using pseudo cross-validation to determine model parameters (in particular, they will be included in the final prediction model).

While it is still possible to use --ind-hers <indhersfile> to provide Per-Predictor Heritabilities, this is NO LONGER RECOMMENDED.

By default, LDAK will use 90%/10% cross-validation to determine suitable prior distribution parameters. You can change the fraction of test samples using --cv-proportion <float>,  or turn off cross-validation, using --skip-cv YES (LDAK will then output multiple models, each trained using 100% of samples).

You can use --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors.

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

LDAK will estimates effect sizes using overlapping windows of predictors. The step size is the window size divided by the number of segments. The default window size is 1Mb, but you can change this using --window-kb <float> or --window-cm <float>. Meanwhile, the default number of segments is eight, but you can change this using --segments <powerof2>. For example, if you used --window-kb 3000 and --segments 2, then each predictor will be included in two windows (so the windows will start 1.5Mb apart).

The prediction model is saved in <outfile>.effects. This file will have five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its raw effect (relative to the A1 allele) and is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets, as well as the file highld.txt from High-LD Regions.

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

We identify SNPs that are within high-LD regions by running

./ldak.out --cut-genes highld --bfile human --genefile highld.txt

Usually, the lists of SNPs in high-LD regions would be saved in the file highld/genes.predictors.used. However, in this case, the script fail because our test data contains only Chromosomes 21 & 22, which do not include any high-LD regions.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Calculate predictor-predictor correlations.

We run the command

./ldak.out --calc-cors cors --bfile human

The significant correlations are saved in cors.cors.bin, with some details in cors.cors.root. If analysing very large data, we can parallelise the process by computing correlations separately for each chromosome, then merging

for j in {21..22}; do
./ldak.out --calc-cors cors$j --bfile human --chr $j

rm list.txt; for j in {21..22}; do echo "cors$j" >> list.txt; done
./ldak.out --join-cors cors --corslist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Construct the prediction model.

Normally, we would run the command

./ldak.out --mega-prs mega --model elastic --summary quant.summaries --power -0.25 --cors cors --high-LD highld/genes.predictors.used --allow-ambiguous YES

However, this script will fail because, for the reason explained above, the file highld/genes.predictors.used does not exist. Therefore, for this example only, we will run

./ldak.out --mega-prs mega --model elastic --summary quant.summaries --power -0.25 --cors cors --check-high-LD NO --allow-ambiguous YES

The prediction model is saved in mega.effects.