Worked Examples

If you have access to individual-level data, we recommend constructing a prediction model using Bolt-Predict assuming the BLD-LDAK Model. If you only have access to summary statistics, we recommend constructing a prediction model usingĀ  BayesR-SS (contained within MegaPRS) assuming the BLD-LDAK Model. Here we provide complete examples for using Bolt-Predict and BayesR-SS. The first example should take about 30 minutes, the second should take about three hours to run on a single CPU or less than an hour if you have access to a computer cluster.

Please note that these examples focus on the commands, and provide only a brief description of what each command does. If you wish to fully understand the process (i.e., how LDAK first estimates per-predictor heritabilities, given the heritability model, then estimates effect sizes), you should read Prediction and click on the links next to each command.

Further, in the second example, we use a reference panel containing 404 individuals and summary statistics from the GIANT Consortium. However, please note that you should ideally use a larger reference panel (e.g., at least 2000 individuals), and should not create Pseudo Summaries when the summary statistics have been subjected to genomic control.

To run these examples, you require the LDAK executable (obtain from Downloads), which we assume you have renamed ldak.out. To avoid duplicate file names, we suggest you run each example in its own folder (which at the start contains only the LDAK executable).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example of running Bolt-Predict (uses individual-level data):

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Preparation

Download and extract the test dataset.

wget https://www.dropbox.com/s/5vcfcree3xnvhew/data.zip?dl=1 -O data.zip
unzip data.zip
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Estimate per-predictor heritabilities, assuming the BLD-LDAK Model.

Compute summary statistics (for more details, see Single-Predictor Analysis).

./ldak.out --linear quant --bfile human --pheno quant.pheno

Download and extract the first 64 SNP annotations for the BLD-LDAK Model (see BLD-LDAK Annotations).

wget https://genetics.ghpc.au.dk/doug/bld.zip
unzip bld.zip

Compute LDAK weightings, and rename these bld65 (see LDAK Weightings).

./ldak.out --cut-weights sections --bfile human
./ldak.out --calc-weights-all sections --bfile human
mv sections/weights.short bld65

Calculate tagging file assuming the BLD-LDAK Model (see Calculate Taggings).

./ldak.out --calc-tagging bld.ldak --bfile human --ignore-weights YES --power -.25 --annotation-number 65 --annotation-prefix bld --window-cm 1 --save-matrix YES

Estimate heritability contributed by each predictor (see SNP Heritability).

./ldak.out --sum-hers bld.ldak --tagfile bld.ldak.tagging --summary quant.summaries --matrix bld.ldak.matrix
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Create the prediction model.

Estimate effect sizes (see Bolt-Predict).

./ldak.out --bolt bolt --pheno quant.pheno --bfile human --ind-hers bld.ldak.ind.hers --cv-proportion .1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The final model is saved in bolt.effects.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example of running BayesR-SS (uses summary statistics):

Here we use summary statistics from the 2014 meta-analysis of height from the GIANT Consortium, while the reference panel uses genotype data for non-Finnish europeans from the 1000 Genomes Project.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Preparation

Download and extract a non-Finnish European reference panel constructed from 1000 Genomes Project data (the scripts we used to make this panel are in the example of Reference Panel). Note that we recommend using a larger panel (at least 2000 samples).

wget https://genetics.ghpc.au.dk/doug/ref.tgz
tar -xzvf ref.tgz

Download and format the summary statistics (for more details, see Summary Statistics). The raw results use rs ids, so we will use the file ref.names (downloaded in the previous step), to convert these to names of the form Chr:BP, and to exclude SNPs incompatible with our reference panel.

#Download results from GIANT website, and extract SNPs with single-basepair alleles
wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz
gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR>1){snp=$1;a1=$2;a2=$3;dir=$5;stat=($5/$6)^2;n=$8}(NR==1){print "Predictor A1 A2 Direction Stat n"}(NR>1 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T")){print snp, a1, a2, dir, stat, n}' - > height.raw

#Use ref.names to convert rs ids and exclude SNPs with inconsistent alleles
awk '(NR==FNR){arr[$1]=$2;ars[$1]=$3$4;next}(FNR==1){print $0}($1 in arr && ($2$3==ars[$1]||$3$2==ars[$1])){$1=arr[$1];print $0}' ref.names height.raw > height.txt

For all analyses, we will only use non-ambiguous reference panel SNPs for which we have summary statistics.

awk '(NR==FNR && (($2=="A"&&$3=="C") || ($2=="A"&&$3=="G") || ($2=="C"&&$3=="A") || ($2=="C"&&$3=="T") || ($2=="G"&&$3=="A") || ($2=="G"&&$3=="T") || ($2=="T"&&$3=="C") || ($2=="T"&&$3=="G")) ){arr[$1];next}($2 in arr){print $2}' height.txt ref.bim > use.snps

As we will be using pseudo summary statistics, we must divide the reference panel samples into three.

awk < ref.fam '(NR%3==1){print $0 > "keepa"}(NR%3==2){print $0 > "keepb"}(NR%3==0){print $0 > "keepc"}'

Further, we must download details of long-range linkage disquilibrium regions and identify which reference panel SNPs they contain (see High-LD Regions).

wget http://dougspeed.com/wp-content/uploads/highld.txt
./ldak.out --cut-genes highld --bfile ref --genefile highld.txt
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Estimate per-predictor heritabilities, assuming the BLD-LDAK Model.

Download and extract the first 64 SNP annotations for the BLD-LDAK Model (see BLD-LDAK Annotations).

wget https://genetics.ghpc.au.dk/doug/bld.zip
unzip bld.zip

Compute LDAK weightings, and rename these bld65 (see LDAK Weightings).

for j in {1..22}; do
./ldak.out --cut-weights sections$j --bfile ref --extract use.snps --chr $j

./ldak.out --calc-weights-all sections$j --bfile ref --extract use.snps --chr $j
done
cat sections{1..22}/weights.short > bld65

Calculate tagging file assuming the BLD-LDAK Model (see Calculate Taggings).

./ldak.out --calc-tagging bld.ldak --bfile ref --extract use.snps --ignore-weights YES --power -.25 --annotation-number 65 --annotation-prefix bld --window-cm 1 --save-matrix YES

Estimate heritability contributed by each SNP (see SNP Heritability).

./ldak.out --sum-hers bld.ldak --tagfile bld.ldak.tagging --summary height.txt --matrix bld.ldak.matrix
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Create the prediction model.

Generate pseudo training and test summary statistics (see Pseudo Summaries). Note that these summary statistics have been subjected to genomic control, so we do not advise using them to make psuedo summary statistics.

./ldak.out --pseudo-summaries height --bfile ref --extract use.snps --summary height.txt --training-proportion .9 --keep keepa

Calculate predictor-predictor correlations (see MegaPRS).

./ldak.out --calc-cors cors --bfile ref --extract use.snps --window-cm 3 --keep keepb

Estimate effect sizes for training and full prediction models (see MegaPRS).

./ldak.out --mega-prs bayesR --model bayesr --bfile ref --extract use.snps --cors cors --ind-hers bld.ldak.ind.hers --summary height.txt --summary2 height.train.summaries --window-cm 1

Test the full prediction models, excluding SNPs in high-LD regions (see MegaPRS).

./ldak.out --calc-scores bayesR --bfile ref --keep keepc --scorefile bayesR.effects.train --summary height.test.summaries --power 0 --final-effects bayesR.effects.final --exclude highld/genes.predictors.used

Note that the commands in this step will produce warnings that there are many trivial SNPs, which reflects that we are using a small reference panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The final model is saved in bayesR.effects.best.