Quick PRS

Quick PRS is a super-fast way to construct state-of-the-art SNP-based prediction models for complex human traits that requires only summary statistics. Quick PRS is an approximate version of MegaPRS; whereas MegaPRS requires the user to compute a tagging file, heritability matrix and predictor-predictor correlations, Quick PRS uses versions of these files pre-computed using data from the UK Biobank (for more details, read How Quick PRS Works).

Currently, we provide sets of pre-computed files corresponding to four populations: GBR (computed using 2000 white British individuals), SAS (4214 Indian and Pakistani individuals), EAS (1279 Chinese individuals) and AFR (2577 African individuals). Click here to see a principal component plot illustrating the four different populations.

For each population, we provide sets of pre-computed files corresponding to two SNP subsets: 1.0-1.2M non-ambiguous HapMap3 SNPs and 320-580k non-ambiguous directly genotyped SNPs. You should use whichever version best matches your summary statistics (in general, you should use the HapMap3 version if you have summary statistics from a GWAS that used imputation, otherwise use the directly genotyped version).

Please note that you should only use Quick PRS when you have summary statistics for the majority of the SNPs used to generate the pre-computed files. If you are missing summary statistics for more than 20% of the SNPs, you should instead use one of the alternative prediction tools (i.e., Ridge-Predict, Bolt-Predict, BayesR-Predict or MegaPRS). The example below provides a script for working out the overlap of SNPs (alternatively, LDAK will report this number when you use the command --sum-hers <outfile> to estimate the per-predictor heritabilities).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Preparation:

Your SNP names should be in the form Chr:BP using genomic positions from the GRCh37/hg19 assembly (so they match those in the pre-computed files). If you are unsure whether the genomic positions are from the GRCh37/hg19 assembly, you can check a few SNPs in the UCSC Genome Browser (if you find the positions are from a different assembly, you can update them using the LiftOver Tool).

When using publicly-available summary statistics, you might only have rs ids (and not genomic positions). The following two files provide details of the HapMap3 and directly genotyped SNPs, which you can use to convert rs ids to genomic positions (see the example below).

HapMap3 SNP Details (All populations)
Directly genotyped SNP Details (All populations)

Your summary statistics should be in the format required by LDAK (see Summary Statistics for details). If you have individual-level data, you can obtain summary statistics using the command --linear <outfile> (for details on this command, see Single-Predictor Analysis).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download and extract pre-computed files:

You should download the set of pre-computed file that best matches your summary statistics (i.e., in terms of ancestry and SNP subset). Note that when using a UNIX operating system, you can download a file by typing wget followed by its web address (you can obtain the web address of the files below by right-clicking on the corresponding links).

Pre-computed files for GBR population (HapMap3 SNPs)
Pre-computed files for GBR population (directly genotyped SNPs)

Pre-computed files for SAS population (HapMap3 SNPs)
Pre-computed files for SAS population (directly genotyped SNPs)

Pre-computed files for EAS population (HapMap3 SNPs)
Pre-computed files for EAS population (directly genotyped SNPs)

Pre-computed files for AFR population (HapMap3 SNPs)
Pre-computed files for AFR population (directly genotyped SNPs)

Having downloaded a set of files, you should extract it. On a UNIX operating system, you can extract a set of files by typing tar -xzvf followed by the file name.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Construct the prediction model:

First you should estimate per-predictor heritabilities using the command --sum-hers <outfile>. You should use --summary <sumsfile>, --tagfile <taggingfile> and --matrix <matrixfile> to provide, respectively, your summary statistics, the pre-computed tagging file and the pre-computed heritability matrix. You should also add --check-sums NO (otherwise LDAK will report an error if some SNPs are missing). For details on this command, see the example below or SNP Heritability.

The estimated per-predictor heritabilities will be saved in <outfile>.ind.hers.

Then you can construct the prediction model using the command --mega-prs <outfile>. You should use --summary <sumsfile>, --ind-hers <indhersfile>, --cors <corstem> and --high-LD <highldpredictors> to provide, respectively, your summary statistics, the estimates of per-predictor heritabilities, the stem of the pre-computed predictor-predictor correlations and the pre-computed list of SNPs in regions of high linkage disequilibrium. You should also add --extract <sumsfile> (otherwise LDAK will report an error if some SNPs are missing. For details on this command, see the example below or MegaPRS. Note that when constructing the prediction model, we recommend using --model bayesr, --cv-proportion .1, and --window-cm 1.

The prediction model will be saved in <outfile>.effects. This file will have five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele) and is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use summary statistics from the 2014 meta-analysis of height by the GIANT Consortium. We can download these using the command

wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz

The Giant Consortium analysed European samples, and therefore we will use pre-computed files generated using GBR individuals. The summary statistics provide results from analysing imputed SNPs, and therefore is best to the HapMap3 versions (we confirm this is the case below).

Note that the scripts use the tool awk, which is very efficient at processing large files and is usually installed by default on any UNIX operating system. You can read more about awk here.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Check SNP overlap and format summary statistics

First we download and unzip the details of the HapMap3 and directly genotyped SNPs

wget https://genetics.ghpc.au.dk/doug/hapmap.allpops.snp.details.gz
wget https://genetics.ghpc.au.dk/doug/genotyped.allpops.snp.details.gz
gunzip hapmap.allpops.snp.details.gz
gunzip genotyped.allpops.snp.details.gz

Next we extract the SNP names (in this case, rs ids) and alleles from the summary statistics. Note that at this point, we should perform quality control. The summary statistics contains per-predictor sample sizes, which vary between 50,003 and 253,280 (average 239,025), so we will exclude those with size less than 200,000 (note that this threshold is fairly arbitrary, so you might choose a different threshold). If the summary statistics included information scores, we would also exclude SNPs with score <0.95.

gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR>1&& $8>=200000){print $1, $2, $3}' > height.details

Now we work out which of the summary statistics predictors are also in each of the SNP subsets (checking that the corresponding alleles match). Note that Column 7 of each SNP details files indicates which SNPs were used when making the GBR files (Columns 8, 9 and 10 indicate which SNPs were used when making the SAS, EAS and AFR versions, respectively). In this step, it is convenient to print out both the rs ids and Chr:Pos form for each overlapping SNP, as we can use these below when formatting summary statistics.

awk '(NR==FNR){arr[$1]=$2$3;next}($6 in arr && (arr[$6]==$4$5 || arr[$6]==$5$4) && $7=="YES"){print $1, $6}' > overlap.hapmap height.details hapmap.allpops.snp.details
wc -l overlap.hapmap
awk '($7=="YES")' hapmap.allpops.snp.details | wc -l

awk '(NR==FNR){arr[$1]=$2$3;next}($6 in arr && (arr[$6]==$4$5 || arr[$6]==$5$4) && $7=="YES"){print $1, $6}' > overlap.genotyped height.details genotyped.allpops.snp.details
wc -l overlap.genotyped
awk '($7=="YES")' genotyped.allpops.snp.details | wc -l

The first set of commands tell us that we have summary statistics for 1,010,682 of the 1,168,975 HapMap3 SNPs (86%), confirming that it is OK to use the HapMap3 set of pre-computed files. By contrast, the second set of commands tell us that we have summary statistics for only 193,106 out of 577,457 directly genotyped SNPs (33%), which is not enough to use the directly genotyped set of pre-computed files.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Finally, we format the summary statistics, reducing to the overlapping SNPs and ensuring that the names are in the form Chr:BP.

gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR==FNR){arr[$2]=$1;next}(FNR==1){print "Predictor A1 A2 Direction Stat n"}($1 in arr){snp=arr[$1];a1=$2;a2=$3;dir=$5;stat=($5/$6)^2;n=$8;print snp, a1, a2, dir, stat, n}' > height.hapmap overlap.hapmap -

Check this file looks correct

head -n 5 height.hapmap
Predictor A1 A2 Direction Stat n
10:100012890 A G -0.0062 4.27111 252156
10:100013563 T C -0.0087 5.24169 248425
10:100016339 T C 0.011 14.3876 253135
10:100017453 T G -0.014 20.3954 251364

Check there are no duplicate predictors (if there are, the following command will print them out)

awk < height.hapmap '{print $1}' | sort | uniq -d | head
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Construct the prediction model

In Step 1, we determined that we should use the HapMap3 version of the pre-computed files generated using GBR individuals. We therefore download and extract these

wget https://genetics.ghpc.au.dk/doug/gbr.hapmap.tar.gz
tar -xzvf gbr.hapmap.tar.gz

We first estimate per-predictor heritabilities by running

./ldak.out --sum-hers height.bld.ldak --summary height.hapmap --tagfile gbr.hapmap/gbr.hapmap.bld.ldak.quickprs.tagging --matrix gbr.hapmap/gbr.hapmap.bld.ldak.quickprs.matrix --check-sums NO

The estimated per-predictor heritabilities are saved in height.bld.ldak.ind.hers. Note that the screen output confirms that we have summary statistics for 1,064,138 of the 1,168,975 SNPs (matching the number we calculated above).

We then construct a BayesR prediction model by running

./ldak.out --mega-prs height.bld.ldak.bayesr --summary height.hapmap --ind-hers height.bld.ldak.ind.hers --cors gbr.hapmap/gbr.hapmap --high-LD gbr.hapmap/highld.snps --model bayesr --cv-proportion .1 --window-cm 1 --extract height.hapmap

The prediction model will be saved in height.bld.ldak.bayesr.effects.

Note that if you wish to convert the predictor names back to rs ids, you can use the command

awk '(NR==FNR){a[$1]=$2;next}(FNR>1){$1=a[$1]}{print $0}' overlap.hapmap height.bld.ldak.bayesr.effects > height.bld.ldak.bayesr.effects.rsids