Prediction

LDAK includes multiple tools for constructing linear prediction models (polygenic risk scores), using either individual-level data or summary statistics. The key advantage of our tools is that they allow the user to specify the Heritability Model. By contrast, almost all existing prediction tools consider only the GCTA Model (each predictor is expected to contribute equal heritability). In our publication, Improved genetic prediction of complex traits from individual-level data or summary statistics (Nature Communications, 2021), we showed that for 223 out of 225 complex traits, prediction accuracy improves when we use a more realistic heritability model. For example, we found that when we replaced the GCTA Model with the BLD-LDAK Model, the average increase in R² (squared correlation between observed and predicted phenotypes) was 14%, equivalent to increasing the sample size by a quarter. Moreover, our tools that use individual-level data are computationally efficient, making them able to handle biobank-sized data (e.g., genome-wide SNP data for over 200,000 samples).

Please note, that we have simplified the MegaPRS software (specifically, we now recommend using the Human Default Heritability Model, instead of the BLD-LDAK Model, which means it is no longer necessary to first calculate per-predictor heritabilities)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

When analysing individual-level data:

1 - We recommend using Elastic-Predict.
2 - You should perform careful Quality Control, to avoid confounding due to population structure or genotyping errors.

Note that while we recommend using Elastic-Predict, it is still possible to use the 2021 tools Ridge-Predict, Bolt-Predict and BayesR-Predict, or even the 2014 tool MultiBLUP.

When analysing summary statistics:

1 - We recommend using Elastic-SS (contained within MegaPRS).
2 - Your summary statistics should be in the format required by LDAK (see Summary Statistics for details).
3 - You will require a (well-matched) reference panel. Ideally, this should contain at least 2000 samples (otherwise, Reference Panel provides scripts for constructing a smaller panel from 1000 Genome Project data).
4 - When using SNP data, we recommend excluding predictors with ambiguous alleles (A & T or C & G), in order to avoid strand errors.

QuickPRS:

If you are analysing human SNP data, we have now added Quick PRS, a super-fast way to construct state-of-the-art prediction models that requires only summary statistics.

Step-by-step examples:

In Worked Examples we construct two prediction models from scratch, first analysing individual-level data (using Elastic-Predict), then summary statistics (using Elastic-SS).