Summary Statistics

Each set of summary statistics should be saved in a text file with a header row. Usually, it will have the following columns (note that column labels are case-sensitive):

Predictor - the name of the predictor; while it is common to use rs-numbers, it is sometimes easier to instead construct generic names of the form chr:bp (e.g., replace rs1870516 with 1:5072295).

A1 - the test allele.
A2 - the other allele.

Direction - indicates whether the effect is positive or negative (with respect to the test allele). This can be an estimate of effect size (log odds), or simply set to +1 and -1 for postitive- and negative-effect predictors, respectively.

Stat - provides a chi-squared test statistic. Note that if E represents the estimate of effect size (or log odds) of a predictor, and S is its standard deviation, then a chi-squared test statistic is (E/S)2.

n - number of samples used when testing the predictor.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Points to note:

1 - Predictor names must be unique
2 - Only single-character alleles are allowed (multi-character alleles will be replaced).
3 - If info scores are provided, we suggest excluding predictors with scores < 0.95.
4 - When estimating SNP heritability, direction is not required (so can be excluded).

5 - Instead of test statistics, you can instead provide p-values in a column labelled P.
6 - If per-SNP effect sizes are not available, then use the total sample size.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

When formatting the summary statistics, we create two further files

1 - An extract list of predictors for which we have summary statistics.
2 - An exclude list of predictors in the major histocompatibility complex (MHC), which individually explain more than 1% of phenotypic variance, or are in are in LD with these (within 1 centiMorgan and with squared-correlation >0.1).

We will use these lists to filter predictors when calculating the Tagging File. Note that if your aim is to estimate Genetic Correlations, you may wish to also exclude predictors with ambiguous alleles (A & T or C & G).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example: this uses summary statistics from the most recent meta-analysis of height from the Giant Consortium. It also assumes your reference panel is saved in binary PLINK format in the files ref.bed, ref.bim and ref.fam.

wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz

gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR>1){snp=$1;a1=$2;a2=$3;dir=$5;stat=($5/$6)^2;n=$8}(NR==1){print "Predictor A1 A2 Direction Stat n"}(NR>1 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T")){print snp, a1, a2, dir, stat, n}' - > height.txt

head -n 5 height.txt
Predictor A1 A2 Direction Stat n
rs4747841 A G -0.0011 0.143876 253213
rs4749917 T C 0.0011 0.143876 253213
rs737656 A G -0.0062 4.27111 253116
rs737657 A G -0.0062 4.27111 252156

It is sensible to check whether there are duplicate predictor names; the following command identifies which names appear more than once using the unix functions awk, sort and uniq (if it produces no output, there are no duplicates).

awk < height.txt '{print $1}' | sort | uniq -d | head

If there are duplicates, we could rename, but it is easier to simply remove all except one copy using this (magic) awk command (I don't fully understand how it works!)

mv height.txt height2.txt
awk '!seen[$1]++' height2.txt > height.txt
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To create a list of predictors for which we have summary statistics, use

awk < height.txt '(NR>1){print $1}' > height.predictors

To reduce these to predictors with non-ambiguous alleles, use

awk < height.txt '( ($2=="A"&&$3=="C") || ($2=="A"&&$3=="G") || ($2=="C"&&$3=="A") || ($2=="C"&&$3=="T") || ($2=="G"&&$3=="A") || ($2=="G"&&$3=="T") || ($2=="T"&&$3=="C") || ($2=="T"&&$3=="G") ){print $1}' > height.nonamb

To identify which (reference panel) predictors are in the MHC, use

awk < ref.bim '($1==6 && $4>25000000 && $4<34000000){print $2}' > mhc.snps

The variance explained by a predictor is Stat/(Stat+n), so we can create a list of large-effect predictors using

awk < height.txt '(NR>1 && $5>$6/99){print $1}' > height.big

For the height data, there are no large-effect predictors ... but were there some, we could then identify which (reference panel) predictors are in LD with these using the command

../ldak.out --remove-tags height --bfile ref --top-preds height.big --window-cm 1 --min-cor .1

The file height.out specifies which predictors are in LD with those in height.big. Note that if your reference panel does not contain genetic distances, an approximate solution is to replace --window-cm 1 with --window-kb 1000.  We can now combine the two lists using

cat mhc.snps height.out > height.exclude