The genetic correlation between two traits is the average correlation between SNP effect sizes. A positive (negative) correlation indicates that SNPs that tend to have a positive effect on Trait 1 have a positive (negative) effect on Trait 2. To estimate the genetic correlation between a pair of traits, the first step is to obtain a tagging file. Either you can use Pre-computed Taggings or Calculate Taggings yourself. This step requires you to choose a Heritability Model. We have shown that estimates of genetic correlation are generally insensitive to the choice of heritability model, and therefore we recommend using the LDAK-Thin Model (while not as accurate as the BLD-LDAK Model, it is much simpler). The second step is to regress two sets of (correctly-formatted) Summary Statistics onto the tagging file, described below.
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --sum-cors <outfile>
This requires the options
--tagfile <taggingfile> - to specify the tagging file
--summary <sumsfile> and --summary2 <sums2file> - to specify the files containing summary statistics for each trait.
Analyses can be sensitive to large-effect loci (particularly, if these are located in regions of extreme linkage disequilibrium, such as the MHC). Previously, we recommended first using --remove-tags <outfile> to identify predictors tagging loci that explain more than 1% of phenotypic variance, then excluding these using --extract <extractfile> and/or --exclude <excludefile>. However, this can now be done more easily using the option --cutoff <float>; for example, to remove predictors that explain more than 1% of phenotypic variance, add --cutoff 0.01 (note that this will not also remove predictors tagging the large-effect loci, but in practice, we find this makes little difference).
By default, LDAK will ignore predictors with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.
LDAK will report an error if the summary statistics file does not provide summary statistics for all predictors in the tagging file. If a relatively small proportion of predictors are affected (say less than 20%), it should be OK to override this error by adding --check-sums NO.
When estimating genetic correlations, LDAK will allow for multiplicative inflation of test statistics, as we found this guards against misspecification of the Heritability Model (see here for a demonstration). To turn this off, use --genomic-control NO, while to instead allow for additive inflation, use --intercept YES.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main output file is <outfile>.cors. This reports first estimates of SNP heritability for each trait, then the estimated SNP coheritability between the two traits, then the estimated genetic correlation. Next it provides estimates of the inflation of test statistic (assuming the analysis allowed either for multiplicative or additive inflation). Finally, it provides an estimate of the overlap. This is the first term in the expression for E[ZAj,ZBj] (see the section "Estimating genetic correlation" in the Online Methods of our paper), and is the product of the phenotypic correlation between the two traits and the fraction of samples common to the two studies. Therefore, if the two studies are independent, we expect the overlap to be zero (but note that deviation from zero can reflect not only sample overlap, but also misspecification of the heritability model).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotypes quant.pheno and quant2.pheno from the Test Datasets, as well as the tagging file LDAK-Thin.tagging created in the example for Calculate Taggings.
In order to estimate genetic correlation we require two sets of summary statistics (i.e., the results from regressing the phenotype on each SNP individually). We can obtain these by running
./ldak.out --linear quant --bfile human --pheno quant.pheno
./ldak.out --linear quant2 --bfile human --pheno quant2.pheno
The summary statistics are saved in quant.summaries and quant2.summaries (already in the format required by SumHer). For more details on this command, see Single-Predictor Analysis.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The estimate of genetic correlation is obtained by regressing the two sets of summary statistics on the tagging file. We recommend using the tagging file created assuming the LDAK-Thin Model
./ldak.out --sum-cors gencor --summary quant.summaries --summary2 quant2.summaries --tagfile LDAK-Thin.tagging --allow-ambiguous YES
Here, we added --allow-ambiguous YES, so that LDAK does not exclude SNPs with alleles A & T or C & G (the summary statistics for the two traits were obtained from the same data, so we know that the strands must be consistent). LDAK warns us there are large-effect loci. Usually, we should follow its advice and add --cutoff 0.01, in order to exclude SNPs that individually explain at least 1% of phenotypic variance for either of the traits (we have not done so here, but only because this is a toy example).
The main results are stored in gencor.cors. This says that the estimated SNP heritabilities are 0.06 (SD 0.20) and 0.09 (SD 0.20), the estimated SNP coheritability between the two traits is 0.09 (SD 0.19), and the estimated genetic correlation is 1.33 (the SD is nan, reflecting that this is a toy example with very small sample size and few SNPs). The file gencor.cors.full divides these estimates into categories, however, this is not relevant here, because the LDAK-Thin Model only uses one category.