Calculate Scores

Here we explain how to calculate linear combinations of predictor values (the linear projection of genetic data onto predictor effect sizes). These are most commonly used for creating polygenic risk scores and testing the performance of prediction models. However, they are also used for Quality Control (e.g., you can infer sample ancestry by using a global dataset to calculate population axes, then projecting your dataset onto these axes).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --calc-scores <outfile>.

This requires the following options

--scorefile <scorefile> - to provide the predictor effect sizes. The score file should have at least five columns. The first 4 columns provide the name of each predictor, its two alleles (test allele followed by reference allele), then its centre (the mean number of test alleles); the remaining columns provide sets of predictor effect sizes. The file should have a header row, whose first element must be "Predictor" or "SNP". Note that if centre is "NA" for a predictor, then LDAK will centre values based on the mean number of test alleles in the genetic data.

–bfile/–chiamo/–sp/–speed <prefix> - to specify genetic data files (see File Formats)

--power <float> - to specify how predictors are scaled (see below). Usually, the score file contains raw effects, so you should use --power 0.

Typically, the sets of effect sizes in the score file will correspond to different prediction models, and we are interested in measuring their accuracy. There are two ways to do this. Either you have phenotypes for samples in the genetic data (i.e., you have individual-level validation data), in which case you should provide these using --pheno <phenofile>. LDAK will then calculate the correlation between scores and phenotypic values for the samples in the genetic data. Alternatively, you have summary statistics from an association study, in which case you should provide these using --summary <sumsfile>. LDAK will then estimate the correlation between scores and phenotypic values for the samples in the association study (the genetic data will be used a Reference Panel, in order to estimate predictor-predictor correlations for the samples in the association study).

The profile scores will be saved in <outfile>.profile, while if you use either --pheno <phenofile> or --summary <sumsfile>, the (estimated) correlation between scores and phenotypic values will be saved in <outfile>.cors.

Sometimes you will have two sets of prediction models, for example, one computed using training samples and one computed using all samples. You can then use --scorefile <scorefile> to provide the first set of prediction models, --pheno <phenofile> or --summary <sumsfile> to provide phenotypic values or summary statistics, and --final-effects <finaleffectsfile> to provide the second set of prediction models. LDAK will save to <outfile>.effects.best the prediction model from the second set that corresponds to the most accurate model from the first set (for example, if Model 5 from the first set has highest correlation, LDAK will save Model 5 from the second set).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Formula:

Suppose there are m predictors. Let cj and bj be the centre and effect size of Predictor j. By default, the score of Sample i is

Pi = b1 (Xi1-c1) (c1(1-c1/2))^(power/2) + ... + bm (Xim-cm) (cm(1-cm/2))^(power/2)

where Xij is the value of Sample i for Predictor j. When the predictors are SNPs, cj/2 is an estimate of the allele frequency of Predictor j, and therefore (cj(1-cj/2)) is its expected variance assuming Hardy-Weinberg Equilibrium. When power equals one, predictors are divided by their expected standard deviation (i.e., are standardised). Therefore, you should use --power=-1 when the score file contains standardised effect sizes. By contrast, when power equals zero, the formula reduces to

Pi = b1 (Xi1-c1) + ... + bm (Xim-cm)

so that predictors are no longer scaled. Therefore, you should use --power=0 when the score file contains raw effect sizes.

If you add --hwe-stand NO, then LDAK will instead use the formula

Pi = bj (Xi1-c1) v1^(power/2) + ... + bm (Xim-cm) vm^(power/2)

where vj is the variance of Predictor j in the genetic data.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. We will construct a toy score file using the following command

echo "Predictor A1 A2 Centre Effect1 Effect2
21:14642464 A G 0.88 0.3 -0.1
21:14649798 C A 0.97 -0.2 0.4" > scores.txt

Note that 21:14642464 and 21:14649798 are the names of the first two SNPs in human.bim. We can see they are both common SNPs (their centres are close to 1, meaning that their minor allele frequencies are close to 0.5).

We calculate scores by running

./ldak.out --calc-scores scores --scorefile scores.txt --bfile human --power 0

The file scores.profile contains two profiles, corresponding to the two sets of effect sizes. For the first profile, the score of Sample i is

Pi = 0.3 (Xi1 - 0.88) + -0.2 (Xi2 - 0.97)

while for the second profile, the score of Sample i is

Pi = -0.1 (Xi1 - 0.88) + 0.4 (Xi2 - 0.97)

where Xi1 is the value of Sample i for Predictors 1 (the number of A alleles), while Xi2 is the value of Sample i for Predictor 2 (the number of alleles).