Here we explain how to construct a prediction model using MegaPRS, which contains the tools Lasso-SS, Ridge-SS, Bolt-SS and BayesR-SS. These instructions assume you are analysing summary statistics (if instead you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict). Further, they require that you have already estimated Per-Predictor Heritabilities.

MegaPRS will begin by constructing prediction models corresponding to multiple prior distribution parameters (e.g., when you select the tool BayesR-SS, it will consider 84 prior parameters). To decide the best parameters, MegaPRS requires independent training and test summary statistics. On rare occasions, you will have these (e.g., if you have results from L cohorts in a meta-analysis, you could construct training summary statistics by combining results from L-1 cohorts, and use results from the final cohort as the test summary statistics). However, you will most likely only have a single set of summary statistics, computed using all samples, in which case you should generate Pseudo Summaries. Note that when using pseudo training summary statistics to test different prior parameters, we recommend you exclude High-LD Regions, as explained below.

MegaPRS requires a Reference Panel, which is used to calculate predictor-predictor correlations and to test different prior parameters. If you are generating pseudo summaries (which also requires a reference panel), then you must use different reference panel samples for each step. Therefore, if you have only one reference panel, we suggest you divide its samples into three, then use one third to generate pseudo summary statistics, one third to calculate predictor-predictor correlations, and one third to test training prediction models. If you are not using pseudo summaries, then it is fine to use all reference panel samples both when calculating predictor-predictor correlations and testing training models.

Always read the screen output, which suggests arguments and estimates memory usage.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Calculate predictor-predictor correlations:**

The main argument is --calc-cors <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--window-cm <float> - to specify the window size (LDAK will only compute correlations for predictions within this distance). It generally suffices to use 3cM. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 3000.

Use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples. As explained above, if you are using pseudo-summaries, then you must ensure the samples used to calculate predictor-predictor correlations are distinct from those used to generate the pseudo summaries and those used to test prior parameters.

By default, LDAK will save correlations between pairs that are significant (P<0.01). This corresponds to predictors pairs with squared correlation greater than 1-exp(-6.6/n), where n is the sample size. To change this threshold use --min-cor <thresh>.

To specify a subset of predictors, use --extract <extractfile> and/or --exclude <excludefile>. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easier to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

The predictor-predictor correlations will be saved in the file <outfile>.cors.bin (this is a binary file, so not human-readable), while details of the data files are provided in <outfile>.cors.root.

To parallelize this process, you can add --chr <integer> to calculate correlations for each chromosome separately, then combine these with the argument --join-cors <output>, using the option --corslist <corsstems> to specify the per-chromosome correlations (see the example below).

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Estimate effect sizes for training and full prediction models:**

The main argument is --mega-prs <outfile>.

This requires the options

--model <type> - to specify the tool. MegaPRS contains Lasso-SS (--model <lasso>), Lasso-SS-Sparse (--model <lasso-sparse>), Ridge-SS (--model <ridge>), Bolt-SS (--model <bolt>), BayesR-SS (--model <bayesr>) and BayesR-Shrink (--model <bayesr-shrink>). If you use --model <mega>, MegaPRS will usethe data to decide which of Lasso-SS, Ridge-SS, Bolt-SS and BayesR-SS is most appropriate. In general, we recommend use BayesR-SS (i.e., using --model bayesr).

--cors <corstem> - to specify the predictor-predictor correlations.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--summary <sumsfile> and --summary2 <sums2file> - to specify files containing summary statistics. Typically <sumsfile> will contain full summary statistics, while <sums2file> will contain training summary statistics. If you prefer to provide only one summary statistic file, use --one-sums YES instead of --summary2 <sums2file>. More details below.

--ind-hers <indhersfile> - to specify the per-predictor heritabilities.

--window-cm <float> - to specify the window size (see below). It generally suffices to use 1cM. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

To specify a subset of predictors, use --extract <extractfile> and/or --exclude <excludefile>. Note that LDAK will report an error if any predictors are missing summary statistics (in which case, you can use --extract <extractfile> to specify the predictors with summary statistics).

LDAK will estimates effect sizes using overlapping windows of predictors. The step size is the window size (specified using --window-cm <float> or --window-kb <float>) divided by the number of segments. By default, there are eight segments, but you can change this using --segments <powerof2>. For example, if you used --window-cm 3 and --segments 2, then each predictor will be included in two windows (so the windows will start 1.5cM apart).

The most common way to run MegaPRS is using both --summary <sumsfile> and --summary2 <sums2file>. LDAK will then expect <sumsfile> to contain full summary statistics (calculated using all samples) and <sums2file> to contain training summary statistics (computed across training samples or using Pseudo Summaries). LDAK will first use the training summary statistics to create training prediction models (these will be saved in the file <outfile>.effects.train), then use the full summary statistics to create full prediction models (these will be saved in <outfile>.effects.final).

If you instead use --summary <sumsfile> and --one-sums YES, LDAK will only make one set of models (saved in <outfile>.effects). This approach is useful if you wish to use different predictor-predictor correlations when creating training and full prediction models. For example, if you only have a small reference panel, you can use a third of samples when making the training models, but all samples when making the full models.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Test training prediction models:**

You should use the command --calc-scores <outfile> to measure the accuracies of the training models. For details on this command, see the examples below or Calculating Scores. Make sure you use --scorefile <scorefile> to provide the training models, and --final-effects <finaleffectsfile> to provide full models. The final model (i.e., the full model corresponding to the most accurate training model) will be saved in <outfile>.effects.best. This file has five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele).

If you are using pseudo summary statistics, there are two points to note. Firstly, you should ensure that the samples in the reference panel are distinct from both those used to generate the pseudo summaries and those used to calculate predictor-predictor correlations (you can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples). Secondly, you should use --extract <extractfile> to exclude SNPs in High-LD Regions.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Example:**

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets, as well as the file highld.txt from High-LD Regions. We also use the file ldak.thin.ind.hers, created in the example for Per-Predictor Heritabilities. This contains estimates of the heritability contributed by each predictor, obtained assuming the LDAK-Thin Model (note that we normally recommend using the BLD-LDAK Model, but as this is only an example, we use the simpler model).

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Because we only have one set of summary statistics, we will be using pseudo summary statistics. We first divide the reference panel individuals into three by running

awk < human.fam '(NR%3==1){print $0 > "keepa"}(NR%3==2){print $0 > "keepb"}(NR%3==0){print $0 > "keepc"}'

This script creates keepa, keepb and keepc, each of which contains a third of the samples; we will use keepa samples to create pseudo summary statistics, keepb samples to calculate predictor-predictor correlations, and keepc samples to test prior distributions.

Next we generate pseudo summary statistics by running

./ldak.out --pseudo-summaries quant --bfile human --summary quant.summaries --training-proportion .9 --keep keepa --allow-ambiguous YES

The pseudo training and test summary statistics are saved in quant.training.summaries and quant.test.summaries, respectively. Note that in this script (and the scripts below), we add --allow-ambiguous YES (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent).

Finally, we must identify SNPs that are within high-LD regions, by running

./ldak.out --cut-genes highld --bfile human --genefile highld.txt

Usually, the lists of SNPs in high-LD regions would be saved in the file highld/genes.predictors.used. However, in this case, the script fail because our test data files include only Chromosomes 21 & 22, which do not include any high-LD regions.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**1 - Calculate predictor-predictor correlations.**

We run the command

./ldak.out --calc-cors cors --bfile human --window-kb 3 --keep keepb

The significant correlations are saved in cors.cors.bin, with some details in cors.cors.root. If analyzing very large data, we can parallelise the process by computing correlations separately for each chromosome, then merging

for j in {21..22}; do

./ldak.out --calc-cors cors$j --bfile human --window-kb 3 --keep keepb --chr $j

done

rm list.txt; for j in {21..22}; do echo "cors$j" >> list.txt; done

./ldak.out --join-cors cors --corslist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**2 - Estimate effect sizes for training and full prediction models.
**

We create training and full models by running the command

./ldak.out --mega-prs megabayesr --model bayesr --bfile human --cors cors --ind-hers ldak.thin.ind.hers --summary quant.summaries --summary2 quant.train.summaries --window-kb 1 --allow-ambiguous YES

For this example, we have created BayesR models (see above for how to change the model type). LDAK will create 84 pairs of models, corresponding to 84 prior distributions. The training models (those calculated using quant.train.summaries) are saved in megabayesr.effects.train, while the full models (those calculated using quant.summaries) are saved in megabayesr.effects.final.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**3 - Test training prediction models
**

Normally, we would identify the most accurate model by running the command

./ldak.out --calc-scores megabayesr --bfile human --scorefile megabayesr.effects.train --summary quant.test.summaries --power 0 --final-effects megabayesr.effects.final --keep keepc --allow-ambiguous YES --exclude highld/genes.predictors.used

However, this script will fail because, for the reason explained above, the file highld/genes.predictors.used does not exist. Therefore, for this example only, we will run

./ldak.out --calc-scores megabayesr --bfile human --scorefile megabayesr.effects.train --summary quant.test.summaries --power 0 --final-effects megabayesr.effects.final --keep keepc --allow-ambiguous YES

LDAK first measures the accuracy of the 84 training models; the estimated correlations between predicted and observed phenotypes are stored in megabayesr.cors. Having determined the most-accurate training model, it saves effect sizes for the corresponding full model in megabayesr.effects.best.