Here we explain how to construct a prediction model using MegaPRS. These instructions assume that you are analysing summary statistics (if instead you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict). Further, they require that you have already estimated Per-Predictor Heritabilities.

Note that the instructions on this page use pseudo cross-validation to determine suitable model parameters (to understand how this works, see Pseudo Summaries). However, if you have access to a validation dataset (i.e., an independent set of individuals for whom you have both genetic data and phenotypes), you can instead read the instructions for running MegaPRS Using a Validation Dataset.

Some prediction software recommend you exclude regions of high linkage disequilibrium (e.g., the MHC) when constructing models. This IS NOT NECESSARY when using MegaPRS (i.e., you can construct prediction models using all predictors).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Calculate predictor-predictor correlations:

The main argument is --calc-cors <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--window-cm <float> - to specify the window size (LDAK will only compute correlations for predictions within this distance). It generally suffices to use --window-cm 3. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 3000.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easiest to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

The predictor-predictor correlations will be saved in files with prefix <outfile>; <outfile>.cors.bin contains the predictor-predictor correlations, (this is a binary file, so not human-readable), <outfile>.cors.root and <outfile>.cors.bim contain details of the datafiles, while <outfile>.cors.noise contains values used by LDAK to generate Pseudo Summaries.

To parallelize this process, you can add --chr <integer> to calculate correlations for each chromosome separately, then combine these with the argument --join-cors <output>, using the option --corslist <corsstems> to specify the per-chromosome correlations (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Construct the prediction model:

The main argument is --mega-prs <outfile>.

This requires the options

--model <type> - to specify the tool. MegaPRS contains Lasso-SS (--model lasso), Lasso-SS-Sparse (--model lasso-sparse), Ridge-SS (--model ridge), Bolt-SS (--model bolt), BayesR-SS (--model bayesr) and BayesR-Shrink (--model bayesr-shrink). If you use --model mega, MegaPRS will use the data to decide which of Lasso-SS, Ridge-SS, Bolt-SS and BayesR-SS is most appropriate. In general, we recommend using BayesR-SS (i.e., using --model bayesr).

--ind-hers <indhersfile> - to specify the per-predictor heritabilities.

--summary <sumsfile> - to specify the file containing the summary statistics.

--cors <corstem> - to specify the predictor-predictor correlations.

--cv-proportion <float> - to specify the fraction of samples used for testing different parameters of the prior distribution when performing pseudo cross-validation. We suggest using --cv-proportion 0.1, so that models are trained using 90% of samples, then tested using the remaining 10%. Note that if you have manually generated Pseudo Summaries, you should instead use --pseudos <pseudostem>, while to turn off cross-validation, you should instead use --skip-cv YES (LDAK will then output multiple models, each trained using 100% of samples).

--high-LD <highldpredictors> - to specify predictors in regions of high linkage disequilibrium (LD). High-LD Regions explains how to identify predictors in high-LD regions when analysing human data. If there are no predictors in high-LD regions, you should use --check-high-LD NO. Please note that predictors in high-LD regions will be excluded only when using pseudo cross validation to determine model parameters (in particular, they will be included in the final prediction model).

--window-cm <float> - to specify the window size (see below). It generally suffices to use --window-cm 1. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors.

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

LDAK will estimates effect sizes using overlapping windows of predictors. The step size is the window size (specified using --window-cm <float> or --window-kb <float>) divided by the number of segments. By default, there are eight segments, but you can change this using --segments <powerof2>. For example, if you used --window-cm 3 and --segments 2, then each predictor will be included in two windows (so the windows will start 1.5cM apart).

The prediction model is saved in <outfile>.effects. This file will have five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its raw effect (relative to the A1 allele) and is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets, as well as the file highld.txt from High-LD Regions. We also use the file ldak.thin.ind.hers, created in the example for Per-Predictor Heritabilities. This contains estimates of the heritability contributed by each predictor, obtained assuming the LDAK-Thin Model (note that we normally recommend using the BLD-LDAK Model, but as this is only an example, we use the simpler model).

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

We identify SNPs that are within high-LD regions by running

./ldak.out --cut-genes highld --bfile human --genefile highld.txt

Usually, the lists of SNPs in high-LD regions would be saved in the file highld/genes.predictors.used. However, in this case, the script fail because our test data files include only Chromosomes 21 & 22, which do not include any high-LD regions.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Calculate predictor-predictor correlations.

We run the command

./ldak.out --calc-cors cors --bfile human --window-cm 3

The significant correlations are saved in cors.cors.bin, with some details in cors.cors.root. If analysing very large data, we can parallelise the process by computing correlations separately for each chromosome, then merging

for j in {21..22}; do
./ldak.out --calc-cors cors$j --bfile human --window-cm 3 --chr $j

rm list.txt; for j in {21..22}; do echo "cors$j" >> list.txt; done
./ldak.out --join-cors cors --corslist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Construct the prediction model.

Normally, we would run the command

./ldak.out --mega-prs megabayesr --model bayesr --ind-hers ldak.thin.ind.hers --summary quant.summaries --cors cors --cv-proportion .1 --high-LD highld/genes.predictors.used --window-cm 1 --allow-ambiguous YES

However, this script will fail because, for the reason explained above, the file highld/genes.predictors.used does not exist. Therefore, for this example only, we will run

./ldak.out --mega-prs megabayesr --model bayesr --ind-hers ldak.thin.ind.hers --summary quant.summaries --cors cors --cv-proportion .1 --check-high-LD NO --window-cm 1 --allow-ambiguous YES

The prediction model is saved in megabayesr.effects.