Prediction

LDAK includes seven new tools for constructing linear prediction models (polygenic risk scores), using either individual-level data or summary statistics. The key advantage of our new tools is that they allow the user to specify the Heritability Model. By contrast, almost all existing prediction tools consider only the GCTA Model (each predictor contributes equal heritability). We show in our recent publication, Improved genetic prediction of complex traits from individual-level data or summary statistics (Nature Communications, 2021), that for 223 out of 225 complex traits, prediction accuracy improves when we use a more realistic heritability model. For example, we found that when we replaced the GCTA Model with the BLD-LDAK Model, the average increase in R² (squared correlation between observed and predicted phenotypes) was 14%, equivalent to increasing the sample size by a quarter. Moreover, our tools that use individual-level data are computationally efficient, making them able to handle biobank-sized data (e.g., genome-wide SNP data for over 200,000 samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

There are two steps to construct a prediction model. The first is to estimate Per-Predictor Heritabilities. In this step, you must specify the heritability model. The second step is to estimate effect sizes. If you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict. If you are analysing summary statistics, you should use Lasso-SS, Ridge-SS, Bolt-SS or BayesR-SS (these four tools are contained within MegaPRS).

When analysing individual-level data:

1 - You should perform careful Quality Control, to avoid confounding due to population structure, relatedness or genotyping errors.
2 - You will require summary statistics from Single-Predictor Analysis (these are used to estimate per-predictor heritabilities).

When analysing summary statistics:

1 - The summary statistics should be in the format required by LDAK (see Summary Statistics for details).
2 - You will require a (well-matched) reference panel. Ideally, this should contain at least 2000 samples (otherwise, Reference Panel provides scripts for constructing a smaller panel from 1000 Genome Project data).
3 - It is easiest to run the full version of MegaPRS if you have a validation dataset. If you do not have a validation dataset, you can either generate Pseudo Summaries or use MegaPRS-Lite.
4 - When using SNP data, we recommend excluding predictors with ambiguous alleles (A & T or C & G), in order to avoid strand errors.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Of the seven tools, we recommend Bolt-Predict, as this tends to produce the most accurate models. However, this requires individual-level data. If you only have access to summary statistics, we recommend that you instead use BayesR-SS (contained within MegaPRS).

Finally, in Worked Examples we construct two prediction models from scratch, first analysing individual-level data, then summary statistics.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Note that it is still possible to view the pages for MultiBLUP, the individual-level prediction tool we created in 2014. However, please be aware that we now recommend using the newer tools described above.