Using a Validation Dataset

Here we explain how to construct a prediction model using MegaPRS when you have access to a validation dataset (i.e., an independent set of individuals for whom you have both genetic data and phenotypes). These instructions assume that you are analysing summary statistics (if instead you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict). Further, they require that you have already estimated Per-Predictor Heritabilities.

Please note that if you do not have access to a validation dataset, you should instead read the regular instructions for running MegaPRS.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Calculate predictor-predictor correlations:

The main argument is --calc-cors <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--window-cm <float> - to specify the window size (LDAK will only compute correlations for predictions within this distance). It generally suffices to use --window-cm 3. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 3000.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easiest to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

The predictor-predictor correlations will be saved in the file <outfile>.cors.bin (this is a binary file, so not human-readable), while details of the data files are provided in <outfile>.cors.root.

To parallelize this process, you can add --chr <integer> to calculate correlations for each chromosome separately, then combine these with the argument --join-cors <output>, using the option --corslist <corsstems> to specify the per-chromosome correlations (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Construct multiple prediction models:

The main argument is --mega-prs <outfile>.

This requires the options

--model <type> - to specify the tool. MegaPRS contains Lasso-SS (--model lasso), Lasso-SS-Sparse (--model lasso-sparse), Ridge-SS (--model ridge), Bolt-SS (--model bolt), BayesR-SS (--model bayesr) and BayesR-Shrink (--model bayesr-shrink). If you use --model mega, MegaPRS will use the data to decide which of Lasso-SS, Ridge-SS, Bolt-SS and BayesR-SS is most appropriate. In general, we recommend using BayesR-SS (i.e., using --model bayesr).

--ind-hers <indhersfile> - to specify the per-predictor heritabilities.

--summary <sumsfile> - to specify the file containing the summary statistics.

--cors <corstem> - to specify the predictor-predictor correlations.

--skip-cv YES - this tells LDAK that you are not using pseudo cross-validation (LDAK will then output multiple models, each trained using 100% of samples).

--window-cm <float> - to specify the window size (see below). It generally suffices to use --window-cm 1. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors.

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

LDAK will estimates effect sizes using overlapping windows of predictors. The step size is the window size (specified using --window-cm <float> or --window-kb <float>) divided by the number of segments. By default, there are eight segments, but you can change this using --segments <powerof2>. For example, if you used --window-cm 3 and --segments 2, then each predictor will be included in two windows (so the windows will start 1.5cM apart).

The prediction models corresponding to different prior distribution parameters are saved in <outfile>.effects. This file will have 4+M columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele) for each of the M models.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Use the validation dataset to find the best prediction model

The main argument is --validate <outfile>.

This requires the options

--scorefile <scorefile> - to provide the prediction models constructed in the previous step.

–bfile/–chiamo/–sp/–speed <prefix> - to specify the genetic data files in the validation dataset (see File Formats).

--pheno <phenofile> - to specify phenotypes (in PLINK format) for the samples in the validation dataset. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

The estimated accuracy of each prediction model (the correlation between predicted and observed phenotypes) is saved in <outfile>.cors. The best prediction model (i.e, the one with highest accuracy) is saved in <outfile>.best.effects. This file has five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele), and is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. We also use the file ldak.thin.ind.hers, created in the example for Per-Predictor Heritabilities. This contains estimates of the heritability contributed by each predictor, obtained assuming the LDAK-Thin Model (note that we normally recommend using the BLD-LDAK Model, but as this is only an example, we use the simpler model).

For this example we require summary statistics, a reference panel and an independent validation dataset. We make these by running

head -n 300 human.fam > keep
./ldak.out --linear quant.keep --bfile human --pheno quant.pheno --keep keep
./ldak.out --make-bed human.ref --bfile human --keep keep
./ldak.out --make-bed human.val --bfile human --remove keep

Summary statistics are saved in quant.keep.summaries (already in the format required by LDAK). We will use the genetic data files with prefix human.ref as the reference panel, and the files with prefix human.val as the validation dataset (phenotypes for these samples are in quant.pheno).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Calculate predictor-predictor correlations.

We run the command

./ldak.out --calc-cors cors.ref --bfile human.ref --window-cm 3

The significant correlations are saved in cors.ref.cors.bin, with some details in cors.ref.cors.root. If analyzing very large data, we can parallelise the process by computing correlations separately for each chromosome, then merging

for j in {21..22}; do
./ldak.out --calc-cors cors.ref$j --bfile human --window-cm 3 --chr $j
done

rm list.txt; for j in {21..22}; do echo "cors.ref$j" >> list.txt; done
./ldak.out --join-cors cors.ref --corslist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Construct multiple prediction models.

We run the command

./ldak.out --mega-prs megabayesr.val --model bayesr --ind-hers ldak.thin.ind.hers --summary quant.keep.summaries --cors cors.ref --skip-cv YES --window-cm 1 --allow-ambiguous YES

Because we are using BayesR-SS, this command will construct 84 prediction models, saved in megabayesr.val.effects. Note that we add --allow-ambiguous YES (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Use the validation dataset to find the best prediction model.

We run the command

./ldak.out --validate megabayesr.val --scorefile megabayesr.val.effects --bfile human.val --pheno quant.pheno

The best prediction model is saved in megabayesr.val.best.effects.