Prediction

LDAK includes seven new tools for constructing linear prediction models (polygenic risk scores). If you are analysing individual-level data, LDAK provides implementations of Ridge Regression, Bolt-LMM and BayesR. If you are analysing summary statistics, LDAK provides implementations of Lasso, Ridge Regression, Bolt-LMM and BayesR.

The key advantage of our new tools is that they allow the user to specify the Heritability Model. By contrast, almost all existing prediction tools consider only the GCTA Model (each predictor contributes equal heritability). We show in our recent publication (currently available on Biorxiv), that for 14 complex traits, prediction accuracy ALWAYS improves when we improve the accuracy of the heritability model (e.g., replace the GCTA Model with either the LDAK-Thin or BLD-LDAK Model). On average, we found the increase in R² (squared correlation between observed and predicted phenotypes) was about 10%, equivalent to increasing the sample size by a quarter. Moreover, our implementations that use individual-level data are computationally efficient, making them able to handle biobank-sized data (e.g., genome-wide SNP data for over 200,000 samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

There are two steps to construct a prediction model. The first is to estimate Per-Predictor Heritabilities. In this step, you must specify the heritability model. The second step is to estimate effect sizes. If you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict. If you are analysing summary statistics, you should use MegaPRS (this is able to make Lasso, Ridge Regression, Bolt-LMM and BayesR prediction models).

When analysing individual-level data:

1 - You should perform careful Quality Control, to avoid confounding due to population structure, relatedness or genotyping errors.
2 - You will require summary statistics from Single-Predictor Analysis (these are used to estimate per-predictor heritabilities).

When analysing summary statistics:

1 - The summary statistics should be in the format required by LDAK (see Summary Statistics for details).
2 - You will require a (well-matched) reference panel. Ideally, this should contain at least 2000 samples (otherwise, Reference Panel provides scripts for constructing a smaller panel from 1000 Genome Project data).
3 - You will require training and test summary statistics. If you do not have these, you can create approximate versions of these using Pseudo Summaries.
4 - When using SNP data, we recommend excluding predictors with ambiguous alleles (A & T or C & G), in order to avoid strand errors.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Of the seven tools, we recommend Bolt-Predict, as this tends to produce the most accurate models. However, this requires individual-level data. If you only have access to summary statistics, we recommend that you instead use MegaPRS to construct a BayesR Model.

Finally, in Worked Examples we construct two prediction models from scratch, first analysing individual-level data, then summary statistics.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Note that it is still possible to view the pages for MultiBLUP, the individual-level prediction tool we created in 2014. However, please be aware that we now recommend using the newer tools described above.