Summary Statistics

Each set of summary statistics should be saved in a text file with a header row. Usually, it will have the following 5 or 6 columns (note that column names are case-sensitive):

Predictor - the name of the predictor (ideally in the form Chr:BP, see below).
A1 - the test allele.
A2 - the other allele.
n - number of samples used when testing the predictor.

Then there are three choices. Your file should have this column:
Z
- provides a Gaussian test statistic. Note that if E represents the estimate of effect size (or log odds) of a predictor, and S is its standard deviation, then a Gaussian test statistic is E/S.

Or your file should have these two columns:
Direction - indicates whether the effect is positive or negative (with respect to the test allele). This can be an estimate of effect size (or log odds), or simply set to +1 and -1 for predictors with postitive and negative effect, respectively.
Stat - provides a chi-squared test statistic. Note that if E represents the estimate of effect size (or log odds) of a predictor, and S is its standard deviation, then a chi-squared test statistic is (E/S)2.

Or your file should have these two columns:
Direction - indicates whether the effect is positive or negative (with respect to the test allele). This can be an estimate of effect size (log odds), or simply set to +1 and -1 for predictors with postitive and negative effect, respectively.
P - provides a p-value
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Points to note:

1 - Predictor names must be unique
2 - Only single-character alleles are allowed (usually A, C, G and T)
3
- If per-predictor sample sizes are not available, then use the total sample size.
4 - If information scores are available, we suggest excluding predictors with scores below 0.95 (you should also consider filtering based on per-predictor sample sizes, if available).
5 - You should only use results from association studies that used careful Quality Control.
6 - When using SNP data, we recommend excluding predictors with ambiguous alleles (A&T or C&G), in order to avoid strand errors.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Predictor names:

We generally find it is more convenient if the predictor names are in the form Chr:BP, using genomic positions from the Chr37/hg19 assembly. This is required if Calculating Taggings yourself assuming the BLD-LDAK or BLD-LDAK+Alpha Model, if using Pre-computed Taggings or if calculating Approximate Per-Predictor Heritabilities.

In our experience, most publicly-available summary statistics provide genomic positions (either they explicitly specify the chromosome and basepair for each SNP, or these details are coded within the SNP names). If this is the case, you should make sure the positions are from the Chr37/hg19 assembly (either the documentation will tell you, or you can check a few SNPs in the UCSC Genome Browser). If the positions are from a different assembly, you can update them using the LiftOver Tool.

Sometimes the summary statistics will only provide rs ids. In this case, you will need to obtain the corresponding genomic positions. In the examples below, we provide a script for converting rs ids using details of the HapMap3 SNPs. Note that it is OK to lose some SNPs when converting rs ids, provided a sizeable amount remain (say, at least 1M).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we format summary statistics from the 2014 meta-analysis of height by the GIANT Consortium and from the 2018 analysis of neuroticism by Nagel et al. These scripts only retain SNPs where both alleles are A, C, G or T (i.e., exclude those with multi-character alleles). Further, they convert predictor names to the format Chr:BP (for height, we convert rs ids using the file hapmap3.snps, which provides details of the 1.2M HapMap3 SNPs). Note that the height summary statistics do not include information scores (meaning that we are unable to filter those with score < 0.95).

These scripts use the tool awk, which is very efficient at processing large files and is usually installed by default on any UNIX operating system. You can read more about awk here.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download the summary statistics for height

wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz

Extract the name, alleles, beta (which indicates whether the effect is positive or negative), chi-squared test statistic and sample size for SNPs with single-character alleles (and ensure the header names are "Predictor", "A1", "A2", "Direction", "Stat" and "n").

gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR>1){snp=$1;a1=$2;a2=$3;dir=$5;stat=($5/$6)^2;n=$8}(NR==1){print "Predictor A1 A2 Direction Stat n"}(NR>1 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T")){print snp, a1, a2, dir, stat, n}' - > height.raw

Download the details of HapMap3 SNPs, then use this file to convert rs ids to the generic format. Note that the SNPs not present in the HapMap3 SNPs (or those with inconsistent alleles) will be excluded. If this is a problem, you should obtain genomic positions from a different source (e.g., you could download SNP details from the UK Biobank).

wget https://www.dropbox.com/s/xabjdu6squ6u56r/hapmap3.snps
awk '(NR==FNR){arr[$1]=$2;ars[$1]=$3$4;next}(FNR==1){print $0}($1 in arr && ($2$3==ars[$1]||$3$2==ars[$1])){$1=arr[$1];print $0}' hapmap3.snps height.raw > height.txt

Check the top of the summary statistics file looks correct

head -n 5 height.txt
Predictor A1 A2 Direction Stat n
10:100012890 A G -0.0062 4.27111 252156
10:100013563 T C -0.0087 5.24169 248425
10:100016339 T C 0.011 14.3876 253135
10:100017453 T G -0.014 20.3954 251364

It is sensible to check whether there are duplicate predictor names. The following command identifies which names appear more than once using the unix functions awk, sort and uniq

awk < height.txt '{print $1}' | sort | uniq -d | head

There is no output, indicating that there are no duplicates
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download the summary statistics for neuroticism

wget https://ctg.cncr.nl/documents/p1651/sumstats_neuroticism_ctg_format.txt.gz

Extract the name, alleles, Gaussian test statistic and sample size for SNPs with single-character alleles (and ensure the header names are "Predictor", "A1", "A2", "Z" and "n"). Note that we do not need to provide the direction of effects because this can be obtained from the signs of the Gaussian test statistics.

gunzip -c sumstats_neuroticism_ctg_format.txt.gz | awk '(NR>1){snp=$3":"$4;a1=$5;a2=$6;z=$9;n=$11}(NR==1){print "Predictor A1 A2 Z n"}(NR>1 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T") && $12>0.95){print snp, a1, a2, z, n}' - > neur.txt

Check the top of the summary statistics file looks correct

head -n 5 neur.txt
Predictor A1 A2 Z n
1:752478 A G 1.06 370996
1:752566 A G -0.263 371912
1:752721 A G 0.137 372903
1:753405 A C 0.057 370472

Check for duplicate predictor names

awk < neur.txt '{print $1}' | sort | uniq -d | head

This time, there are duplicates. While we could rename predictors to avoid duplicates, it is easier to simply remove all except one copy using this (magic) awk command (I don't fully understand how it works!)

mv neur.txt neur.txt.old
awk '!seen[$1]++' neur.txt.old > neur.txt