Profile Scores

Suppose you have used MultiBLUP to create a prediction model (list of effect sizes) and wish to predict the phenotypic values for new individuals for whom you have genotypic data. To obtain the predicted phenotypes for these individuals, use the argument

--calc-scores <output>

which requires the following options (to restrict to a subset of data, see Data Filtering):

--scorefile <scorefile> - to provide the sets of effect sizes (in the format described below)

–bfile/–chiamo/–sp/–speed <prefix> - to specify data files (see File Formats)

--power <float> - predictor values are scaled by [2fj(1-fj)]^(power/2), where fj is the MAF of predictor j. Therefore, if your scorefile contains raw effect sizes (which is generally the case), you should use --power 0, but if your scorefile contains standardized effect sizes, use --power -1.

The score file should have one row per predictor and 4+X columns, where X is the number of models provided. The first 4 columns provide the predictor name, Allele 1 (test allele), Allele 2 (reference allele) and the predictor centre; the remaining columns provide the predictor effect sizes for each model. The file should have a header row also containing 4+X elements, the first of which must read "Predictor".

For example, suppose the scorefile contains
Predictor A1 A2 Centre Axis_1 Axis_2
rs1 A C 0.5 0.3 -0.1
rs2 G A 0.3 -0.2 0.4

The first risk profile will be (assuming we use --power 0)
P1 = 0.3 (S[rs1,A] - 0.5) - 0.2 (S[rs2,G] - 0.3)
while the second risk profile will be
P2 = -0.1 (S[rs1,A] -0.5) + 0.4 (S[rs2,G] - 0.3)
where S[X,Y] is the count of Allele Y at Predictor X .

So for example, for Profile 1, for each individual, LDAK will add on 0.3 times the (centred) allele count for rs1 (with respect to Allele A), then subtract 0.2 times the (centred) allele count for rs2 (with respect to Allele G). Note that if we had instead used --power -1, then the S[X,Y] represent standardized allele counts.

Centre corresponds to the mean allele count, so when considering SNP data, will typically be twice the minor allele frequency. If you set centre to NA for a predictor, LDAK will use the observed mean. LDAK assumes each individual has two alleles, so that A2 allele count = 2 - A1 allele count. Therefore, be careful when using non-SNP data; the best solution is to keep the order of alleles for each predictor the same for the data and score files.