Here we explain how to construct a prediction model using a generalised version of Ridge Regression. These instructions assume you are analysing individual-level data (if instead you are analysing summary statistics, you should use MegaPRS). Further, they require that you have already estimated Per-Predictor Heritabilities.
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --ridge <outfile>.
This requires the options
--bfile <datastem> or --speed <datastem> - to specify the genetic data files (see File Formats). If your genetic data are in a different format, you should first Make Data.
--pheno <phenofile> - to specify phenotypes (in PLINK format). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.
--ind-hers <indhersfile> - to specify the per-predictor heritabilities. <indhersfile> should have two columns, providing predictor names then estimated heritabilities.
--cv-proportion <float> - to specify the fraction of samples used for testing different parameters of the prior distribution. We suggest using --cv-proportion 0.1, so that models are trained using 90% of samples (picked at random), then tested using the remaining 10%. If you prefer to specify the test samples you should instead use --cv-samples <cvsampsfile>, while to turn off cross-validation, you should instead use --skip-cv YES (LDAK will then output multiple models, each trained using 100% of samples).
You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors (for more details, see Data Filtering).
You can use --covar <covarfile> to provide covariates (in PLINK format); the phenotype will be regressed on these prior to estimating effect sizes. If the data contains predictors that are definitely associated with the phenotype, you can specify these using --top-preds <toppredslist>. These predictors will be treated as fixed effects (their effect sizes will be estimated when the phenotype is regressed on the regular covariates), instead of as random effects (like regular predictors).
If LDAK fails to complete due to time, it can be resumed by rerunning the command adding --restart YES.
The estimated prediction model is saved in <outfile>.effects. Usually, this file has five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele). If you used --skip-cv YES, there will be effect sizes for each of the different prior parameters. This file is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. We also use the file ldak.thin.ind.hers, created in the example for Per-Predictor Heritabilities. This contains estimates of the heritability contributed by each predictor, obtained assuming the LDAK-Thin Model (note that we normally recommend using the BLD-LDAK Model, but as this is only an example, we use the simpler model).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
We construct a ridge regression model by running the command
./ldak.out --ridge ridge --pheno quant.pheno --bfile human --ind-hers ldak.thin.ind.hers --cv-proportion .1
The estimated effect sizes are saved in ridge.effects.