Confounding Bias

Please note, we no longer recommend estimating Confounding Bias, therefore this page exists only for completeness.

SumHer is a generalized version of LD Score Regression (LDSC), whose original aim was to estimate the average inflation of test statistics in an association study due to confounding (e.g., population structure and familial relatedness). Prior to LDSC, most people would measure confounding bias using the Genomic Inflation Factor (GIF). However, the GIF is calculated assuming there are no causal variants (it divides the median observed chi-squared test statistic by 0.54, its expected value for SNPs that are not associated with the phenotype). We now realise that most complex traits are highly polygenic, and therefore the assumption underlying use of the GIF is highly inappropriate.

LDSC instead seeks to calculate the inflation in test statistics after allowing for inflation due to causal variants. While a worthy aim, and LDSC has been instrumental in stopping people from using the GIF, we have shown the assumptions underlying LDSC are also flawed. Firstly, it assumes that heritability is distributed evenly (the GCTA Model), which we have shown to be suboptimal. Secondly, it assumes that confounding inflates test statistics evenly, an assumption that is hard to test (moreover, it is easy to find real data examples where inflation is SNP-specific).

Therefore, we do not feel it is possible to reliably estimate confounding bias from summary statistics (by contrast, it can be estimated from individual-level data, as explained in Quality Control). For this reason, we recommend that you only use SumHer with summary statistics when confident that these came from an association study that performed careful quality control (i.e., took care to exclude poorly-genotyped SNPs and avoid confounding due to population structure or relatedness).

Note that we do recommend allowing for confounding bias (multiplicative inflation) when estimating Genetic Correlations. Even though the estimates of confounding are not reliable (and it typically results in less accurate estimates of SNP heritability), allowing for confounding bias results in more robust estimates of genetic correlation (see here for a demonstration).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The process for estimating confounding bias is almost identical to that for estimating SNP Heritability, except that when using --sum-hers <outfile> to regress the test statistics onto the tagging file, you should add either the option --genomic-control YES or --intercept YES.

In the absence of confounding bias, E[Sj] = 1 + nj v2j, where Sj is the test statistic for Predictor j, nj is its sample size and v2j is the expected amount of heritability it is expected to tag under the assumed heritability model. When estimating confounding bias, our aim is to determine how much test statistics deviate from these expectations.

We recommend using the model E[Sj] = C (1 + nj v2j), where C indicates the multiplicative inflation of test statistics due to confounding. We prefer this model as it naturally allows for the impact of genomic control, which is often the major source of confounding in an association study). To specify this model, add the option --genomic-control YES.

The alternative is to use the model E[Sj] = 1 + A + nj v2j, where A indicates the additive inflation of test statistics due to confounding. This is the model used by LDSC, and is specified by adding the option --intercept YES.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets, as well as the tagging files HumDef.tagging created in the example for Calculate Taggings.

In order to estimate confounding bias we require summary statistics (i.e., the results from regressing the phenotype on each SNP individually). We can obtain these by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries and binary.summaries (already in the format required by SumHer). For more details on this command, see Single-Predictor Analysis.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To estimate the multiplicative inflation of test statistics due to confounding, use

./ldak.out --sum-hers conf1 --tagfile HumDef.tagging --summary quant.summaries --genomic-control YES

The estimate of inflation is 1.47 (SD 0.15), saved in the file conf1.extra (this is reported as the scaling, because it is the average amount inflation has scaled each test statistic).

To estimate the additive inflation of test statistics due to confounding, use

./ldak.out --sum-hers conf2 --tagfile HumDef.tagging --summary quant.summaries --intercept YES

The estimate of inflation is again 1.47 (SD 0.15), saved in the file conf2.extra (following the notation of LDSC, this is reported as the intercept, which is one plus the average amount inflation has added to each test statistic).

Note that the estimated confounding is the same whether we assume multiplicative or additive inflation, however, the corresponding estimates of SNP heritability will generally be different (in these examples, the estimated SNP heritability is 0.017 when we assume multiplicative inflation and 0.025 when we assume additive inflation).