Worked Examples

If you have access to individual-level data, we recommend constructing a prediction model using Bolt-Predict assuming the BLD-LDAK Model. If you only have access to summary statistics, we recommend constructing a prediction model usingĀ  BayesR-SS (contained within MegaPRS) assuming the BLD-LDAK Model. Here we provide complete examples for using Bolt-Predict and BayesR-SS. The first example should take about 30 minutes, the second should take about three hours to run on a single CPU or less than an hour if you have access to a computer cluster. Both examples require less than 10Gb memory.

Please note that these examples focus on the commands, and provide only a brief description of what each command does. If you wish to fully understand the process (i.e., how LDAK first estimates per-predictor heritabilities, given the heritability model, then estimates effect sizes), you should read the links next to each command.

Further, in the second example, we use a reference panel containing 404 individuals and summary statistics from the GIANT Consortium. However, please note that you should ideally use a larger reference panel (e.g., at least 2000 individuals).

To run these examples, you require the LDAK executable (obtain from Downloads), which we assume you have renamed ldak.out. To avoid duplicate file names, we suggest you run each example in its own folder (which at the start contains only the LDAK executable).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example of running Bolt-Predict (uses individual-level data):

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Preparation

Download and extract the test dataset.

wget https://www.dropbox.com/s/5vcfcree3xnvhew/data.zip?dl=1 -O data.zip
unzip data.zip
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Estimate per-predictor heritabilities, assuming the BLD-LDAK Model.

Compute summary statistics (for more details, see Single-Predictor Analysis). Note that if you have multiple CPUs, you can speed up the analysis by adding (say) --max-threads 4.

./ldak.out --linear quant --bfile human --pheno quant.pheno

Download and extract the first 64 SNP annotations for the BLD-LDAK Model (see BLD-LDAK Annotations).

wget https://genetics.ghpc.au.dk/doug/bld.zip
unzip bld.zip

Compute LDAK weightings, and rename these bld65 (see LDAK Weightings).

./ldak.out --cut-weights sections --bfile human
./ldak.out --calc-weights-all sections --bfile human
mv sections/weights.short bld65

Calculate the tagging file and heritability matrix assuming the BLD-LDAK Model (see Calculate Taggings). Note that if you have multiple CPUs, you can speed up the analysis by adding (say) (say) --max-threads 4.

./ldak.out --calc-tagging bld.ldak --bfile human --ignore-weights YES --power -.25 --annotation-number 65 --annotation-prefix bld --window-cm 1 --save-matrix YES

Estimate the heritability contributed by each predictor (see SNP Heritability).

./ldak.out --sum-hers bld.ldak --tagfile bld.ldak.tagging --summary quant.summaries --matrix bld.ldak.matrix
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Construct the prediction model.

Estimate effect sizes (see Bolt-Predict). Note that if you have multiple CPUs, you can speed up the analysis by adding (say) --max-threads 4.

./ldak.out --bolt bolt --pheno quant.pheno --bfile human --ind-hers bld.ldak.ind.hers --cv-proportion .1

The final model is saved in bolt.effects.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example of running BayesR-SS (uses summary statistics):

Here we use summary statistics from the 2014 meta-analysis of height from the GIANT Consortium, while the reference panel uses genotype data for non-Finnish europeans from the 1000 Genomes Project. The scripts use the tool awk, which is very efficient at processing large files and is usually installed by default on any UNIX operating system. You can read more about awk here.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Preparation

Download and extract a non-Finnish European reference panel constructed from 1000 Genomes Project data (the scripts we used to make this panel are in the example of Reference Panel). Note that we recommend using a larger panel (at least 2000 samples).

wget https://genetics.ghpc.au.dk/doug/ref.tgz
tar -xzvf ref.tgz

Download and format the summary statistics (for more details, see Summary Statistics). The raw results use rs ids, so we will use the file ref.names (downloaded in the previous step), to convert these to names of the form Chr:BP, and to exclude SNPs incompatible with our reference panel. We additionally exclude SNPs with per-predictor sample size below 200,000 (note that this threshold is fairly arbitrary, so you might choose a different threshold). If the summary statistics included information scores, we would also exclude SNPs with score <0.95.

#Download results from GIANT website, and extract SNPs with single-basepair alleles
wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz
gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR>1){snp=$1;a1=$2;a2=$3;dir=$5;stat=($5/$6)^2;n=$8}(NR==1){print "Predictor A1 A2 Direction Stat n"}(NR>1 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T")){print snp, a1, a2, dir, stat, n}' - > height.raw

#Use ref.names to convert rs ids and exclude SNPs with inconsistent alleles
awk '(NR==FNR){arr[$1]=$2;ars[$1]=$3$4;next}(FNR==1){print $0}($1 in arr && ($2$3==ars[$1]||$3$2==ars[$1])){$1=arr[$1];print $0}' ref.names height.raw > height.txt

For all analyses, we will only use non-ambiguous reference panel SNPs for which we have summary statistics with sample size at least 200,000.

awk '(NR==FNR && (($2=="A"&&$3=="C") || ($2=="A"&&$3=="G") || ($2=="C"&&$3=="A") || ($2=="C"&&$3=="T") || ($2=="G"&&$3=="A") || ($2=="G"&&$3=="T") || ($2=="T"&&$3=="C") || ($2=="T"&&$3=="G")) && $6>=200000){arr[$1];next}($2 in arr){print $2}' height.txt ref.bim > use.snps

As we will be using pseudo summary statistics, we must download details of long-range linkage disquilibrium regions and identify which reference panel SNPs they contain (see High-LD Regions).

wget https://dougspeed.com/wp-content/uploads/highld.txt
./ldak.out --cut-genes highld --bfile ref --genefile highld.txt
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Estimate per-predictor heritabilities, assuming the BLD-LDAK Model.

Download and extract the first 64 SNP annotations for the BLD-LDAK Model (see BLD-LDAK Annotations).

wget https://genetics.ghpc.au.dk/doug/bld.zip
unzip bld.zip

Compute LDAK weightings, and rename these bld65 (see LDAK Weightings).

for j in {1..22}; do
./ldak.out --cut-weights sections$j --bfile ref --extract use.snps --chr $j

./ldak.out --calc-weights-all sections$j --bfile ref --extract use.snps --chr $j
done
cat sections{1..22}/weights.short > bld65

Calculate the tagging file and heritability matrix assuming the BLD-LDAK Model (see Calculate Taggings). Note that if you have multiple CPUs, you can speed up the analysis by adding (say) --max-threads 4.

./ldak.out --calc-tagging bld.ldak --bfile ref --extract use.snps --ignore-weights YES --power -.25 --annotation-number 65 --annotation-prefix bld --window-cm 1 --save-matrix YES

Estimate the heritability contributed by each SNP (see SNP Heritability).

./ldak.out --sum-hers bld.ldak --tagfile bld.ldak.tagging --summary height.txt --matrix bld.ldak.matrix
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Construct the prediction model.

Calculate predictor-predictor correlations (see MegaPRS). Note that if you have multiple CPUs, you can speed up the analysis by adding (say) --max-threads 4.

./ldak.out --calc-cors cors --bfile ref --window-cm 3 --extract use.snps

Construct the prediction model (see MegaPRS). Note that if multiple CPUs are available, you can speed up computation by adding (say) --max-threads 4.

./ldak.out --mega-prs bayesr --model bayesr --ind-hers bld.ldak.ind.hers --summary height.txt --cv-proportion .1 --pseudos height --high-LD highld/genes.predictors.used --window-cm 1 --extract use.snps

The final model is saved in bayesr.effects.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _