Prediction

LDAK includes seven new tools for constructing linear prediction models (polygenic risk scores), using either individual-level data or summary statistics. The key advantage of our new tools is that they allow the user to specify the Heritability Model. By contrast, almost all existing prediction tools consider only the GCTA Model (each predictor is expected to contribute equal heritability). We show in our recent publication, Improved genetic prediction of complex traits from individual-level data or summary statistics (Nature Communications, 2021), that for 223 out of 225 complex traits, prediction accuracy improves when we use a more realistic heritability model. For example, we found that when we replaced the GCTA Model with the BLD-LDAK Model, the average increase in R² (squared correlation between observed and predicted phenotypes) was 14%, equivalent to increasing the sample size by a quarter. Moreover, our tools that use individual-level data are computationally efficient, making them able to handle biobank-sized data (e.g., genome-wide SNP data for over 200,000 samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

There are two steps to construct a prediction model. The first is to estimate Per-Predictor Heritabilities. In this step, you must specify the heritability model. The second step is to estimate effect sizes. If you are analysing individual-level data, you should use Ridge-Predict, Bolt-Predict or BayesR-Predict. If you are analysing summary statistics, you should use Lasso-SS, Ridge-SS, Bolt-SS or BayesR-SS (these four tools are contained within MegaPRS).

If you are analysing human SNP data, we have now added Quick PRS, a super-fast way to construct state-of-the-art prediction models that requires only summary statistics.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

When analysing individual-level data:

1 - You should perform careful Quality Control, to avoid confounding due to population structure, relatedness or genotyping errors.
2 - You will require summary statistics from Single-Predictor Analysis (these are used to estimate per-predictor heritabilities).
3 - If you are analysing human SNP data, we recommend that your predictor names are in the form Chr:BP using genomic positions from the Chr37/hg19 assembly (this form is required if using the BLD-LDAK Annotations).

When analysing summary statistics:

1 - The summary statistics should be in the format required by LDAK (see Summary Statistics for details).
2 - If you are using MegaPRS, you will require a (well-matched) reference panel. Ideally, this should contain at least 2000 samples (otherwise, Reference Panel provides scripts for constructing a smaller panel from 1000 Genome Project data).
3 - When using SNP data, we recommend excluding predictors with ambiguous alleles (A & T or C & G), in order to avoid strand errors.
4 - If you are analysing human SNP data, we recommend that your predictor names are in the form Chr:BP using genomic positions from the Chr37/hg19 assembly (this form is required if using the BLD-LDAK Annotations or Quick PRS).

Note that we previously recommended dividing the reference panel into three parts, and using separate parts for different analyses. This is NO LONGER NECESSARY, and you can now use the full reference panel for each analysis.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Of the seven tools, we recommend Bolt-Predict, as this tends to produce the most accurate models. However, this requires individual-level data. If you only have access to summary statistics, we recommend that you instead use BayesR-SS (contained within MegaPRS).

Finally, in Worked Examples we construct two prediction models from scratch, first analysing individual-level data (using Bolt-Predict), then summary statistics (using BayesR-SS).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Note that it is still possible to view the pages for MultiBLUP, the individual-level prediction tool we created in 2014. However, please be aware that we now recommend using the newer tools described above.