Reference Panel

A reference panel is required when analysing summary statistics. It is used to estimate the correlations between nearby predictors (the linkage disequilibrium). In most cases, the summary statistics will correspond to SNPs (i.e., contain the results from an association study that regressed the phenotype on each SNP individually). In this case, the reference panel should also contain SNP data, from samples ancestrally similar to those used in the association study from which the summary statistics come. For example, when analysing results from a European association study, we often use genotypes for 2000 samples from the UK Biobank.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

If your aim is to perform heritability analysis (i.e., to use SumHer to estimate SNP heritability, heritability enrichments, genetic correlations or the selection-related parameter alpha), then we recommend you use an extensive reference panel. We suggest using an imputed or sequenced dataset (ideally with at least 2000 samples), excluding SNPs with MAF below 0.005 or information score below 0.8. Based on these guidelines, the panel will typically have 8-10M SNPs. Note that, previously, we recommended that the reference panel contained only very high-quality common SNPs for which there were summary statistics. However, in our paper Evaluating and improving heritability models using summary statistics (Nature Genetics, 2020), we showed that it is better to instead use a more extensive panel.

If instead your aim is to construct a prediction model (i.e., to use MegaPRS), then the reference panel needs only contain predictors for which you have summary statistics. As above, we suggest using an imputed or sequenced dataset (ideally with at least 2000 samples), but as well as excluding SNPs with MAF below 0.005 or information score below 0.8, you can exclude those missing summary statistics (and likely also those with ambiguous alleles). Of course, if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easier to create one extensive reference panel, rather than multiple reduced panels.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

If you do not already have a suitable reference panel, you can use genotype data from The 1000 Genomes Project, which contains samples of European, Asian and African ancestry. Here are scripts for downloading and extracting genotype data for the 404 non-Finnish Europeans. Note that we are unable to filter based on information scores (because these are not available), and because the sample size is relatively small, we increase the MAF threshold (from 0.005 to 0.01).

To run these scripts you need both PLINK1.9 and PLINK2. They also use the tool awk, which is very efficient at processing large files and is usually installed by default on any UNIX operating system. You can read more about awk here. Finally, having created an extensive reference panel, we make a reduced version, by reducing to SNPs in the file height.txt. This file contains results from the association study of human height by the GIANT Consortium, and was created in the example for Summary Statistics.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download raw files


Use PLINK2 to decompress the pgen and pvar files

/home/doug/plink2 --zst-decompress all_phase3_ns.pgen.zst > all_phase3_ns.pgen
/home/doug/plink2 --zst-decompress all_phase3_ns.pvar.zst > all_phase3_ns.pvar

Identify non-Finnish Europeans

awk < phase3_corrected.psam '($5=="EUR" && $6!="FIN"){print 0, $1}' > eur.keep

Use PLINK2 to convert to binary PLINK format for non-Finnish Europeans, restricting to autsomal SNPs with MAF>0.01 (and excluding duplicates and SNPs with name ".")

echo "." > exclude.snps
./plink2 --make-bed --out raw --pgen all_phase3_ns.pgen --pvar all_phase3_ns.pvar --psam phase3_corrected.psam --maf 0.01 --autosome --snps-only just-acgt --max-alleles 2 --rm-dup exclude-all --exclude exclude.snps --keep eur.keep

The genotype data will now be stored in binary PLINK format in the files raw.bed, raw.bim and raw.fam. The following commands insert population information and sex into the fam file and replace predictor names with generic names of the form Chr:BP (the latter is not required, but I find this format more convenient). They also save the original names.

awk '(NR==FNR){arr[$1]=$5"_"$6;ars[$1]=$4;next}{$1=$2;$2=arr[$1];$5=ars[$1];print $0}' phase3_corrected.psam raw.fam > clean.fam
awk < raw.bim '{$2=$1":"$4;print $0}' > clean.bim
awk < raw.bim '{print $1":"$4, $2}' > ref.names
cp raw.bed clean.bed

Download genetic distances, then insert these using PLINK1.9

./plink1.9 --bfile clean --cm-map genetic_map_b37/genetic_map_chr@_combined_b37.txt --make-bed --out ref

If these scripts have run successfully, then your European reference panel is saved in Binary PLINK format in the files ref.bed, ref.bim and ref.fam, (you can delete the files with prefixes raw and clean).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Finally, we make a reduced dataset, that contains only non-ambiguous SNPs for which we have summary statistics in the file height.txt

awk < height.txt '(NR>1 && (($2=="A"&&$3=="C") || ($2=="A"&&$3=="G") || ($2=="C"&&$3=="A") || ($2=="C"&&$3=="T") || ($2=="G"&&$3=="A") || ($2=="G"&&$3=="T") || ($2=="T"&&$3=="C") || ($2=="T"&&$3=="G"))){print $1}' > height.snps
./ldak.out --make-bed ref.height --bfile ref --extract height.snps

Note that because the first column of height.txt contains the predictor names, it is equivalent to use the command

./ldak.out --make-bed ref.height --bfile ref --extract height.txt