Here we explain how to construct a prediction model using MegaPRS. These instructions assume you are analysing summary statistics (if instead you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict). Further, they require that you have already estimated Per-Predictor Heritabilities.

We generally recommend that you use MegaPRS to create BayesR Models. However, it can also create Lasso, Ridge Regression and Bolt models. For each model type, it will try multiple prior distributions (e.g., when constructing lasso models, it will try 100 distributions). To decide the best prior distribution, MegaPRS requires independent training and test summary statistics. On rare occasions, you will have these (e.g., if you have results from L cohorts in a meta-analysis, you could construct training summary statistics by combining results from L-1 cohorts, and use results from the final cohort as the test summary statistics). However, most likely, you will only have a single set of summary statistics, computed using all samples, in which case you should generate Pseudo Summaries.

MegaPRS requires a Reference Panel, which is used to calculate predictor-predictor correlations and to test different prior distributions. If you are generating pseudo summaries (which also requires a reference panel), then you must use different reference panel samples for each step. Therefore, if you have only one reference panel, we suggest you divide its samples into three, then use one third to generate pseudo summary statistics, one third to calculate predictor-predictor correlations, and one third to test prior distributions. If you are not using pseudo summaries, then it is fine to use all reference panel samples both when calculating predictor-predictor correlations and testing prior distributions.

Always read the screen output, which suggests arguments and estimates memory usage.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Calculate predictor-predictor correlations:**

The main argument is --calc-cors <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--window-cm <float> - to specify the window size (LDAK will only compute correlations for predictions within this distance). It generally suffices to use 3cM. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 3000.

Use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples. As explained above, if you are using pseudo-summaries, then you must ensure the samples used to calculate predictor-predictor correlations are distinct from those used to generate the pseudo summaries and those used to test prior distributions.

By default, LDAK will save correlations between pairs that are significant (P<0.01). This corresponds to predictors pairs with squared correlation greater than 1-exp(-6.6/n), where n is the sample size. To change this threshold use --min-cor <thresh>.

To specify a subset of predictors, use --extract <extractfile> and/or --exclude <excludefile>. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easier to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

The predictor-predictor correlations will be saved in the file <outfile>.cors.bin (this is a binary file, so not human-readable), while details of the data files are provided in <outfile>.cors.root.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Estimate effect sizes:**

The main argument is --mega-prs <outfile>.

This requires the options

--model <type> - to specify the model type. MegaPRS can create Lasso (--model <lasso>), non-sparse Lasso (--model <lasso-shrink>), Ridge Regression (--model <ridge>), Bolt (--model <bolt>), BayesR models (--model <bayesr>) and non-sparse BayesR (--model <bayesr-shrink>) models. If you use --model <mega>, MegaPRS will create lasso, ridge regression, Bolt-LMM and BayesR Models. In general, we recommend creating BayesR models (i.e., using --model bayesr).

--cors <corstem> - to specify the predictor-predictor correlations.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--summary <sumsfile> and --summary2 <sums2file> - to specify files containing summary statistics. Typically <sumsfile> will contain full summary statistics, while <sums2file> will contain training summary statistics. If you prefer to provide only one summary statistic file, use --one-sums YES instead of --summary2 <sums2file>. More details below.

--ind-hers <indhersfile> - to specify the per-predictor heritabilities.

--window-cm <float> - to specify the window size (see below). It generally suffices to use 1cM. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

To specify a subset of predictors, use --extract <extractfile> and/or --exclude <excludefile>. Note that LDAK will report an error if any predictors are missing summary statistics (in which case, you can use --extract <extractfile> to specify the predictors with summary statistics).

LDAK will estimates effect sizes using overlapping windows of predictors. The step size is the window size (specified using --window-cm <float> or --window-kb <float>) divided by the number of segments. By default, there are eight segments, but you can change this using --segments <powerof2>. For example, if you used --window-cm 3 and --segments 2, then each predictor will be included in two windows (so the windows will start 1.5cM apart).

The most common way to run MegaPRS is using both --summary <sumsfile> and --summary2 <sums2file>. LDAK will then expect <sumsfile> to contain full summary statistics (calculated using all samples) and <sums2file> to contain training summary statistics (computed across training samples or using Pseudo Summaries). LDAK will create two sets of models, first using the training summary statistics (these models will be saved in <outfile>.effects.train), then using the full summary statistics (these models will be saved in <outfile>.effects.final).

If you instead use --summary <sumsfile> and --one-sums YES, LDAK will only make one set of models (saved in <outfile>.effects). This approach is useful if you wish to use different predictor-predictor correlations with training and full summary statistics. For example, if you only have a small reference panel, you can use a third of samples when making the training models, but all samples when making the full models.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Test prior distributions:**

Having created training models corresponding to a variety of prior distributions, you should measure their accuracies using the command --calc-scores <outfile>. For details on this command, see the examples below or Calculating Scores. Make sure you use --scorefile <scorefile> to provide effect sizes estimated using training summary statistics, and --final-effects <finaleffectsfile> to provide effect sizes estimated using full summary statistics. The best model (i.e., the full model corresponding to the most accurate prior distribution) will be saved in <outfile>.effects.best.

Note that you will need to provide genetic data. Usually, this will be a reference panel, in which case you should also use --summary <sumsfile> to provide test summary statistics. However, if you have phenotypes for samples in the genetic data (i.e., you have individual-level validation data), you should instead use --pheno <phenofile> to provide these (in PLINK format).

If you are using pseudo-summaries, then you must ensure the samples used to test prior distributions are distinct from those used to generate the pseudo summaries and those used to calculate predictor-predictor correlations (you can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples).

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Example:**

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. We also use the file ldak.thin.ind.hers, created in the example for Per-Predictor Heritabilities. This contains estimates of the heritability contributed by each predictor, obtained assuming the LDAK-Thin Model (note that we normally recommend using the BLD-LDAK Model, but as this is only an example, we use the simpler model).

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Before using MegaPRS, we create pseudo test and training summary statistics by running

awk < human.fam '(NR%3==1){print $0 > "keepa"}(NR%3==2){print $0 > "keepb"}(NR%3==0){print $0 > "keepc"}'

./ldak.out --pseudo-summaries quant --bfile human --summary quant.summaries --training-proportion .9 --keep keepa --allow-ambiguous YES

This script first creates keepa, keepb and keepc, each of which contains a third of the samples (we then used keepa samples to create pseudo summary statistics, while we will use keepb samples to calculate predictor-predictor correlations, and keepc samples to test prior distributions). Note that in this script (and the scripts below), we add --allow-ambiguous YES (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent).

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**1 - Calculate predictor-predictor correlations.**

We run the command

./ldak.out --calc-cors cors --bfile human --window-kb 3 --keep keepb

The significant correlations are saved in cors.cors.bin, with some details in cors.cors.root.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**2 - Estimate effect sizes.
**

We create training and full models by running the command

./ldak.out --mega-prs megabayesr --model bayesr --bfile human --cors cors --ind-hers ldak.thin.ind.hers --summary quant.summaries --summary2 quant.train.summaries --window-kb 1 --allow-ambiguous YES

For this example, we have created BayesR models (see above for how to change the model type). LDAK will create 84 pairs of models, corresponding to 84 prior distributions. The training models (those calculated using quant.train.summaries) are saved in megabayesr.effects.train, while the full models (those calculated using quant.summaries) are saved in megabayesr.effects.final.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**3 - Test prior distributions
**

We identify the most accurate model by running the command

./ldak.out --calc-scores megabayesr --bfile human --scorefile megabayesr.effects.train --summary quant.test.summaries --power 0 --final-effects megabayesr.effects.final --keep keepc --allow-ambiguous YES

LDAK first measures the accuracy of the 84 training models; the estimated correlations between predicted and observed phenotypes are stored in megabayesr.cors. Having determined the most-accurate training model, it saves effect sizes for the corresponding full model in megabayesr.effects.best.