BayesR Predict

Here we explain how to construct a prediction model using a generalised version of BayesR. These instructions assume you are analysing individual-level data (if instead you are analysing summary statistics, you should use MegaPRS). Note that the original version of Bolt Predict required you to have already estimated Per-Predictor Heritabilities (but this is no longer necessary, nor recommended).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --bayesr <outfile>.

This requires three options

--bfile <datastem> or --speed <datastem> - to specify the genetic data files (see File Formats). If your genetic data are in a different format, you should first Make Data.

--pheno <phenofile> - to specify phenotypes (in PLINK format). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

--LOCO NO - to tell LDAK to focus on creating the genome-wide prediction model (instead of creating leave-one-chromosome-out models for use with LDAK-KVIK).

By default, LDAK will use a fast, chunk-based algorithm, however, you can use --fast NO to revert to the slower, genome-wide algorithmn (note that if using the slow algorithm, and LDAK fails to complete due to time, it can be resumed by rerunning the command adding --restart YES).

By default, LDAK will estimate the heritability and the power parameter alpha; to instead specify their values use --power <float> and --her <float> (note that if you use --her, you must also use --power).

By default, LDAK will use 90%/10% cross-validation to  determine suitable prior distribution parameters . You can change the fraction of test samples uing --cv-proportion <float>,  specify the test samples using --cv-samples <cvsampsfile>, or turn off cross-validation, using --skip-cv YES (LDAK will then output multiple models, each trained using 100% of samples).

By default, LDAK will assign all predictors weighting one (equivalent to using --ignore-weights YES). If you prefer to provide your own weightings, use --weights <weightsfile> or you can use --ind-hers <indhersfile> to provide Per-Predictor Heritabilities (note that if using --ind-hers, you can not use --her or --power).

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors (for more details, see Data Filtering).

You can use --covar <covarfile> to provide covariates (in PLINK format); the phenotype will be regressed on these prior to estimating effect sizes. If the data contains predictors that are definitely associated with the phenotype, you can specify these using --top-preds <toppredslist>. These predictors will be treated as fixed effects (their effect sizes will be estimated when the phenotype is regressed on the regular covariates), instead of as random effects (like regular predictors).

The estimated prediction model is saved in <outfile>.effects. Usually, this file has five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele). If you used --skip-cv YES, there will be effect sizes for each of the different prior parameters. This file is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

We construct an BayesR prediction model by running the command

./ldak.out --bayesr bayesr --bfile human --pheno quant.pheno --LOCO NO

The estimated effect sizes are saved in bayesr.effects.