Reference Panel

A reference panel is required when analysing summary statistics. It is used to estimate the correlations between nearby predictors (the linkage disequilibrium). In most cases, the summary statistics will correspond to SNPs (i.e., contain the results from an association study that regressed the phenotype on each SNP individually). In this case, the reference panel should also contain SNP data, from samples ancestrally similar to those used in the association study from which the summary statistics come. For example, when analysing results from a European association study, we often use genotypes for 2000 samples from the UK Biobank.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

If your aim is to perform heritability analysis (i.e., to use SumHer to estimate SNP heritability, heritability enrichments, genetic correlations or the selection-related parameter alpha), then we recommend you use an extensive reference panel. We suggest using an imputed or sequenced dataset (ideally with at least 2000 samples), excluding SNPs with MAF below 0.005 or information score below 0.8. Based on these guidelines, the panel will typically have 8-10M SNPs. Note that, previously, we recommended that the reference panel contained only very high-quality common SNPs for which there were summary statistics. However, in our paper Evaluating and improving heritability models using summary statistics (Nature Genetics, 2020), we showed that it is better to instead use a more extensive panel.

If instead your aim is to construct a prediction model (i.e., to use MegaPRS), then the reference panel needs only contain predictors for which you have summary statistics. As above, we suggest using an imputed or sequenced dataset (ideally with at least 2000 samples), but as well as excluding SNPs with MAF below 0.005 or information score below 0.8, you can exclude those missing summary statistics (and likely also those with ambiguous alleles). Of course, if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easier to create one extensive reference panel, rather than multiple reduced panels.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

If you do not already have a suitable reference panel, you can use genotype data from The 1000 Genomes Project, which contains samples of European, Asian and African ancestry. Here are scripts for downloading and extracting genotype data for the 404 non-Finnish Europeans. Note that we are unable to filter based on information scores (because these are not available), and because the sample size is relatively small, we increase the MAF threshold (from 0.005 to 0.01).

These scripts use PLINK1.9 (which you can download here). They also use the tool awk, which is very efficient at processing large files and is usually installed by default on any UNIX operating system. You can read more about awk here. Finally, having created an extensive reference panel, we make a reduced version, by reducing to SNPs in the file height.txt. This file contains results from the association study of human height by the GIANT Consortium, and was created in the example for Summary Statistics.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download sample IDs and extract non-Finnish Europeans

awk < integrated_call_samples_v3.20130502.ALL.panel '($3=="EUR" && $2!="FIN"){print $1, $1}' > eur.keep

Download data for each autosome, and convert using PLINK, extracting European samples and SNPs with MAF>0.01

for j in {1..22}; do
./plink --make-bed --out chr$j --maf 0.01 --keep eur.keep --vcf ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Join these together (LDAK will automatically exclude multi-allelic SNPs)

rm list.txt; for j in {1..22}; do echo chr$j >> list.txt; done
./ldak.out --make-bed all --mbfile list.txt --exclude-same YES --exclude-dups YES

We added --exclude-same YES and --exclude-dups YES in order to remove preditors with either the same name or same position. The genotype data will now be stored in binary PLINK format in the files all.bed, all.bim and all.fam.

Replace predictor names with generic names of the form Chr:BP (often not required, but I find this format more convenient). It is useful to create a file containing the original names, generic names and alleles.

awk < all.bim '{$2=$1":"$4;print $0}' > clean.bim
awk < all.bim '{print $2, $1":"$4, $5, $6}' > ref.names
cp all.bed clean.bed
cp all.fam clean.fam

Download and insert genetic distances


for j in {1..22}; do
./plink --bfile clean --chr $j --cm-map genetic_map_b37/genetic_map_chr@_combined_b37.txt --make-bed --out map$j

cat map{1..22}.bim | awk '{print $2, $3}' > map.all
awk '(NR==FNR){arr[$1]=$2;next}{print $1, $2, arr[$2], $4, $5, $6}' map.all clean.bim > ref.bim
cp clean.bed ref.bed
cp clean.fam ref.fam

If these scripts have run successfully, then your reference panel is saved in Binary PLINK format in the files ref.bed, ref.bim and ref.fam, while ref.names provides the original predictor names (you can delete the files with prefixes ALL, chr, all, clean and map).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Finally, we make a reduced dataset, that contains only non-ambiguous SNPs for which we have summary statistics in the file height.txt

awk < height.txt '(NR>1 && (($2=="A"&&$3=="C") || ($2=="A"&&$3=="G") || ($2=="C"&&$3=="A") || ($2=="C"&&$3=="T") || ($2=="G"&&$3=="A") || ($2=="G"&&$3=="T") || ($2=="T"&&$3=="C") || ($2=="T"&&$3=="G"))){print $1}' > height.snps
./ldak.out --make-bed ref.height --bfile ref --extract height.snps

Note that because the first column of height.txt contains the predictor names, it is equivalent to use the command

./ldak.out --make-bed ref.height --bfile ref --extract height.txt