Here we explain how to calculate linear combinations of predictor values (the linear projection of genetic data onto predictor effect sizes). These are most commonly used for creating polygenic risk scores and testing the performance of prediction models. However, they are also used for Quality Control (e.g., you can infer sample ancestry by using a global dataset to calculate population axes, then projecting your dataset onto these axes).

Always read the screen output, which suggests arguments and estimates memory usage.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --calc-scores <outfile>

which requires the following options

--scorefile <scorefile> - to provide the predictor effect sizes. The score file should have at least five columns. The first 4 columns provide the name of each predictor, its two alleles (test allele followed by reference allele), then its centre (the mean number of test alleles); the remaining columns provide sets of predictor effect sizes. The file should have a header row, whose first element must be "Predictor" or "SNP". Note that if centre is "NA" for a predictor, then LDAK will centre values based on the mean number of test alleles in the genetic data.

–bfile/–chiamo/–sp/–speed <prefix> - to specify genetic data files (see File Formats)

--power <float> - to specify how predictors are scaled. In general, if the score file contains raw effects, you should use --power 0, while if it contains standardised effect sizes, you should use --power -1 (see below).

Typically, the sets of effect sizes in the score file will correspond to different prediction models, and we are interested in measuring their accuracy. There are two ways to do this. Either you have phenotypes for samples in the genetic data (i.e., you have individual-level validation data), in which case you should provide these using --pheno <phenofile>. LDAK will then calculate the correlation between scores and phenotypes for the samples in the genetic data. Alternatively, you have summary statistics from an association study, in which case you should provide these using --summary <sumsfile>. LDAK will then estimate the correlation between scores and phenotypes for the samples in the association study (the genetic data will be used a Reference Panel, in order to estimate predictor-predictor correlations for the samples in the association study).

The profile scores will be saved in <outfile>.profile, while if you use either --pheno <phenofile> or --summary <sumsfile>, the (estimated) correlation between scores and phenotypes will be saved in <outfile>.cors.

We often calculate scores when using MegaPRS. Having created training and full models for a variety of prior distributions, we then test the accuracy of the training models, in order to decide which of the full models to use. In this step, we add --final-effects <finaleffectsfile>, where <finaleffectsfile> contains effect sizes calculated using full summary statistics. LDAK will then save the best model (i.e., the full model corresponding to the most accurate prior distribution) in <outfile>.effects.best.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Formula:**

Suppose there are m predictors. Let c_{j} and b_{j} be the centre and effect size of Predictor j. By default, the score of Sample i is

P_{i} = b_{1} (X_{i1}-c_{1}) (c_{1}(1-c_{1}/2))^(power/2) + ... + b_{m} (X_{im}-c_{m}) (c_{m}(1-c_{m}/2))^(power/2)

where X_{ij} is the value of Sample i for Predictor j. When the predictors are SNPs, c_{j}/2 is an estimate of the allele frequency of Predictor j, and therefore (c_{j}(1-c_{j}/2)) is its expected variance assuming Hardy-Weinberg Equilibrium. When power equals one, predictors are divided by their expected standard deviation (i.e., are standardised). Therefore, you should use --power=-1 when the score file contains standardised effect sizes. By contrast, when power equals zero, the formula reduces to

P_{i} = b_{1} (X_{i1}-c_{1}) + ... + b_{m} (X_{im}-c_{m})

so that predictors are no longer scaled. Therefore, you should use --power=0 when the score file contains raw effect sizes.

If you add --hwe-stand NO, then LDAK will instead use the formula

P_{i} = b_{j} (X_{i1}-c_{1}) v_{1}^(power/2) + ... + b_{m} (X_{im}-c_{m}) v_{m}^(power/2)

where v_{j} is the variance of Predictor j in the genetic data.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Example:**

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. We will construct a toy score file using the following command

echo "Predictor A1 A2 Centre Effect1 Effect2

21:14642464 A G 0.88 0.3 -0.1

21:14649798 C A 0.97 -0.2 0.4" > scores.txt

Note that 21:14642464 and 21:14649798 are the names of the first two SNPs in human.bim. We can see they are both common SNPs (their centres are close to 1, meaning that their minor allele frequencies are close to 0.5).

We calculate scores by running

./ldak.out --calc-scores scores --scorefile scores.txt --bfile human --power 0

The file scores.profile contains two profiles, corresponding to the two sets of effect sizes. For the first profile, the score of Sample i is

P_{i} = 0.3 (X_{i1 }- 0.88) + -0.2 (X_{i2 }- 0.97)

while for the second profile, the score of Sample i is

P_{i} = -0.1 (X_{i1 }- 0.88) + 0.4 (X_{i2 }- 0.97)

where X_{i1} is the value of Sample i for Predictors 1 (the number of A alleles), while X_{i2} is the value of Sample i for Predictor 2 (the number of alleles).