SNP Subsets

Warning, the following is complicated, but only advanced users need understand.

When Calculating Taggings, you first require a Reference Panel. Then you must decide three sets of SNPs. The Reference SNPs are those used in the reference panel to compute the expected heritability tagged by each SNP. The Regression SNPs are the subset of Reference SNPs used to estimate the parameters in the heritability model (by regressing the summary statistics on the tagging file). The Heritability SNPs are the subset of Reference SNPs used to calculate estimates of SNP heritability (given the parameter estimates).

As a reference panel, we recommend using an imputed or sequenced dataset (ideally with at least 2000 samples), retaining SNPs with MAF above 0.005 and information score above 0.8 (for human data, there will typically be 8-10M Reference SNPs).

By default, the Reference SNPs are all SNPs in the reference panel. There is normally no reason to change this.

By default, the Regression SNPs are all Reference SNPs. While there is no need to change this (changing the Regression SNPs should not affect the results), there are computational advantages to reducing the number of Regression SNPs (this will result in a smaller tagging file, and reduce the time it takes to regress the summary statistics on the tagging file). Therefore, we recommend using approximately 1M of the Reference SNPs (either picked at random or by restricting to HapMap3 SNPs, a list of which you can download here). Note that when regressing the summary statistics on the tagging file, Regression SNPs without summary statistics will be ignored. Therefore, if you know in advance the SNPs with summary statistics, you can first reduce to these (then from these pick 1M, or reduce to HapMap3 SNPs).

By default, the Heritability SNPs are all Reference SNPs, and this is our preference. The authors of LDSC instead recommend using only SNPs with MAF>.05, but our view is that provided you use a sensible heritability model, this should not be necessary.