The heritability enrichment of a category of SNPs is the estimated share of heritability the category contributes divided by its expected share. The process for estimating heritability enrichments is the same as that for estimating SNP Heritability; you must first obtain a tagging file (either by Calculating Taggings yourself or downloading Pre-computed Taggings), then use --sum-hers <outfile> to regress the summary statistics onto this tagging file. The important thing to note, is that the choice of Heritability Model, and the options used when constructing the tagging file, determine which enrichments are estimated.
This page explains the two different scenarios when estimating enrichments, how to interpret the output files, and how to estimate enrichments for categories of your choice. It assumes you have already read Calculate Taggings and SNP Heritability. It will also greatly help if you have read Heritability Model and Technical Details.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Enrichment scenarios:
To estimate enrichments requires a multi-parameter heritability model. The simplest way to write a multi-parameter heritability model is in the form
E[h2j] = tau1 a1j + ... + tauK aKj
where K is the number of partitions and akj is the value of the kth annotation for SNP j. However, to explain how enrichments are calculated, we instead use the (equivalent) form
E[h2j] = tau1 I1j a1j + ... + tauK IKj aKj
where Ikj indicates whether SNP j is in Category k (i.e., Ikj is either 0 or 1). Each annotation is either binary or continuous. For binary annotations, akj=Ikj, and therefore tauk represents the average amount that E[h2j] increases for SNPs that belong to Category k. For continuous annotations, akj can take any value. In this case, akj represents the relative amount that E[h2j] increases if SNP j belongs to Category k (note that it is common for continuous annotations to include all SNPs, so that Ikj=1, for all j).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Broadly speaking, there are two ways to use annotations. In Scenario 1, there are K categories that partition the genome (i.e., each SNP belongs to exactly one category). In Scenario 2, there are K-1 categories that do not partition the genome (i.e., some predictors will be in multiple categories and/or some will be in none), plus a base category that contains all SNPs.
The diagram above provides examples of each scenario. Example A: There are two categories, one that contains the genic SNPs, the other that contains the non-genic SNPs (i.e., the two categories partition the genome). Example B: There are two categories, one that contains the genic SNPs, the other that contains all SNPs (i.e., one "regular" category and one base category). Example C: There are three categories, one that contains the genic SNPs, one that contains the conserved regions, the other that contains all SNPs (i.e., two "regular" categories and one base category). Example A uses Scenario 1, while Examples B & C use Scenario 2. Note that the heritability models corresponding to Examples A & B are equivalent (in both, E[h2j] depends only on whether or not SNP j is genic), but because they are different scenarios, the output files from LDAK will have slightly different formats (see below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The scenario determines how to provide annotations when Calculating Taggings. Specifically, for Scenario 1, use --partition-number K and --partition-prefix <prefix>, then LDAK will read annotations from <prefix>1, ..., <prefix>K. For Scenario 2, use --annotation-number K-1 and --annotation-prefix <prefix>, then LDAK will read annotations from <prefix>1, ..., <prefix>K-1.
In both cases, <prefix>k provides the values for akj. Usually, <prefix>k has two columns, where the first column lists the SNPs that belong to the category, while the second column provides the corresponding values for akj. However, for binary annotations, the second column can be omitted.
Finally, note that the more general way to write a multi-parameter heritability model is in the form
E[h2j] = wj [fj(1-fj)](1+alpha) x (tau1 I1j c1j + ... + tauK IKj cKj)
where wj and fj are the weighting and MAF of SNP j, and so its value for the kth annotation is akj = wj [fj(1-fj)](1+alpha)ckj. The file <prefix>k now provides the values for ckj, while the weightings and alpha are specified using the options --weights <weightsfile> (or --ignore-weights YES to set wj=1) and --power <float>. This form is used, for example, when assuming the BLD-LDAK or BLD-LDAK+Alpha Model (see Technical Details).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Output files:
When using --sum-hers <outfile> to estimate enrichments, the four main output files are <outfile>.hers, <outfile>.cats, <outfile>.share and <outfile>.enrich. To understand these files, it is necessary to realise that there is a difference between the unique heritability contributed by a category and the total heritability contributed by a category.
The unique heritability contributed by Category k is the sum of I1k tauk akj (i.e., the sum of tauk akj across all SNPs in Category k). This is the heritability directly contributed by Annotation k. The total heritability contributed by Category k is the sum of I1k E[h2j] (i.e., the sum of E[h2j] across all SNPs in Category k). This is the heritability contributed by SNPs in Category k through any annotation. Note that in Scenario 1 (categories partition the genome), the unique and total heritabilities are the same, and each sum to the SNP heritability. However, in Scenario 2 (categories overlap), the unique and total heritabilities are generally not be the same, and while the unique heritabilities sum to the SNP heritability, the total heritabilities will typically exceed the SNP heritability.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The file <outfile>.hers has one row for each category, followed by a row corresponding to all SNPs. The second and third columns provide estimates of the unique heritability contributed by each category, followed by the estimate of the SNP heritability (the heritability contributed by all SNPs). The last two columns provide estimates of the "influence" for each category, then for all SNPs. This is a new statistic, which is an estimate of the proportion of variation in test statistics explained. We think that this statistic might be useful in trying to understand the relative importance of different categories.
The file <outfile>.cats provides estimates of the total heritability contributed by each category.
The file <outfile>.share provides estimates of the share of unique heritability contributed by each category (i.e., the unique heritability contributed by a category, divided by the estimate of SNP heritability).
The file <outfile>.enrich provides estimates of the enrichment of each category, which can be expressed as EST/EXP. Here, EST is the estimated total heritability of the category, divided by the estimated SNP heritability (i.e., the share of SNP heritability that is explained by SNPs in that category). EXP is the share of SNP heritability that category is expected to contribute, equal to the number of SNPs in that category, divided by the total number of SNPs.
Note that previously when calculating enrichments, we set EXP to the expected share of SNP heritability contributed by the category GIVEN the heritability model (this takes into account, for example, that a category that contains high-MAF SNPs, is expected to contribute more than a category that contains an equal number of low-MAF SNPs). However, in practice, we find that estimates of enrichment are not too sensitive to the definition of EXP, which is why we switched to the simpler definition.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
In the example for Pre-computed Taggings, we regressed summary statistics from the association study of human height by the GIANT Consortium, onto the tagging file bld.ldak.hapmap.gbr.tagging. Here we explain the four output files that provide details of enrichments (if you have not performed this example, you can download the files here). The names of the 66 categories in the BLD-LDAK Model are provided in BLD-LDAK Annotations.
height.hers contains estimates of the unique heritability contributed by each of the 66 categories in the BLD-LDAK Model, followed by their sum, which is the estimate of SNP heritability. The unique heritabilities are hard to intepret, so in most cases, you will only be interested in the estimate of SNP heritability, 0.56 (SD 0.01). However, we think the influences are more informative. For example, we see that in total, the BLD-LDAK Model explains 8.9% (SD 0.1%) of variation in test statistics, of which the biggest contributions come from Histone modifications: Categories 18 (H3K27ac_Hnisz), 23 (H3K4me1_Trynka) and 29 (H3K9ac_Trynka), which explain 2.7%, 4.5% and 2.3% of variation, respectively. Note that the standard deviations for Category 31 are negative, indicating they could not be reliably estimated. This is likely due to co-linearity; we are working on refining the heritability model to avoid this problem.
height.cats contains estimates of the total heritability contributed by each of the 66 categories. Note that the sum of these estimates is 14.2, much higher than the estimate of SNP heritability (0.56), and reflects that there is substantial overlap between categories. This is both because many SNPs are assigned to multiple functional categories, and because the model includes buffer categories (e.g., Category 1 contains SNPs inside coding regions, while Category 2 contains SNPs inside or within 500 basepairs of coding regions).
height.share contains estimates of the share of unique heritability contributed by each category (i.e., the estimates of unique heritability divided by 0.56). As with the unique heritabilities, these are hard to intepret.
height.enrich contains estimates of the enrichment of each category, equal to the estimated share of total heritability contributed by SNPs in a category, divided by their expected share. We see the largest enrichments are observed for Categories 1 (Coding_UCSC), 3 (Conserved_LindbladToh) and 58 (GERP.RSsup4), which are estimated to explain 5.9X, 6.0X and 6.9X more heritability than expected.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Incorporating your own categories:
To estimate the enrichment of a category, the heritability model must contain an annotation corresponding to that category. If you are analysing human data, we suggest you start with the BLD-LDAK Model, as this already includes annotations corresponding to 26 commonly-used functional categories (e.g., coding SNP, conserved regions, DNase1 hypersensitivity sites (DHS); see BLD-LDAK Annotations for a full list). If you are interested in different categories, or are analysing non-human data, you must create a new heritability model. You can do this either from scratch, or by adding annotations to an existing heritability model.
If you are interested in Scenario 1 (e.g., you would like to estimate the heritability contributed by each chromosome), we suggest you start from scratch, and use a partitioned version of the LDAK-Thin Model. If you have P partitions, and Ikj indicates which SNPs belong to Partition k, then the heritability model would take the form
E[h2j] = wj [fj(1-fj)]0.75 x (tau1 I1j + ... + tauP IPj)
where wj are the LDAK-Thin weightings (give value one to each predictor that remains after thinning for duplicates, and weighting zero to those removed).
If you are interested in Scenario 2 (e.g., you would like to test the importance of some new functional categories), we suggest you add these to the BLD-LDAK Model. If you have P new categories, and Ipj indicates which SNPs belong to new Category p, then the heritability model would take the form
E[h2j] = [fj(1-fj)]0.75 x (tau1 b1j + ... + tau65 b65j + tau66 I1j + ... + tau65+P IPk) + tau66+P)
where b1, ..., b65 are the 65 annotations of the BLD-LDAK Model.
The reason we have different recommendations for Scenarios 1 & 2, reflects that the types of categories tend to be different. In Scenario 1, each category is typically one or a few large genomic regions (e.g., a whole chromosome), whereas in Scenario 2, it is often very many short regions. We find that when categories are multiple short regions, the accuracy of the heritability model is more important. For example, in our paper Reevaluation of SNP heritability in complex human traits (Nature Genetics, 2017), we showed that DHS appear very enriched when considered on their own, however, much of this enrichment is artificial and disappears when a more accurate heritability model is used (i.e., other annotations are added that model how heritability varies with MAF and linkage disequilibrium). By contrast, when categories are large genomic regions, these biases are less pronounced, so it suffices to include only the categories of interest in the heritability model.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. Our aim is to estimate the heritability contributed by each chromosome. This is an example of Scenario 1, because we have categories that partition the genome. We will therefore use a partitioned version of the LDAK-Thin Model.
In order to use SumHer, we require summary statistics (i.e., the results from regressing the phenotype on each SNP individually). We can obtain these by running
./ldak.out --linear quant --bfile human --pheno quant.pheno
The summary statistics are saved in quant.summaries (already in the format required by SumHer). For more details on this command, see Single-Predictor Analysis.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
To construct the tagging file, we require lists of SNPs on each chromosome. We can create these using awk
rm chr{1..22}; awk < human.bim '{print $2 > "chr"$1}'
mv chr21 set1
mv chr22 set2
When used on human data, the first command would usually create the files chr1, chr2, ..., chr22, listing the SNPs on each chromosome. However, because our example dataset contains only Chromosomes 21 & 22, we get only the files chr21 and chr22. We renamed these files set1 and set2, respectively, because the lists of SNPs must be named sequentially starting from one (i.e., have the form <prefix>1, <prefix>2, etc.).
To calculate a tagging file under the LDAK-Thin Model, we require a weightsfile that gives weighting one to the predictors that remain after thinning for duplicates, and weighting zero to those removed. This can be achieved using the commands
./ldak.out --thin thin --bfile human --window-prune .98 --window-kb 100
awk < thin.in '{print $1, 1}' > weights.thin
Now we create the tagging file, by running
./ldak.out --calc-tagging LDAK-Thin.chr --bfile human --weights weights.thin --power -.25 --window-cm 1 --partition-number 2 --partition-prefix set
The tagging file is saved in LDAK-Thin.chr.tagging. Finally, we regress the summary statistics onto this file.
./ldak.out --sum-hers snpher3 --summary quant.summaries --tagfile LDAK-Thin.chr.tagging
The main output file is snpher3.enrich. This reports that the estimated shares of SNP heritability contributed by Chromosomes 21 & 22 are 0.65 (SD 0.09) and 0.35 (0.09), respectively. Their expected shares are 0.59 and 0.41 (indicating that for this dataset, 59% of SNPs are on Chromosome 21). Their estimated enrichments are 1.1X (SD 0.2) and 0.9X (SD 0.2).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Now we consider a Scenario 2 example. For this, suppose the file set1 lists the SNPs in a new functional category (instead of simply listing those SNPs on Chromosome 21). To estimate the enrichment of a new functional category, we recommend adding the corresponding annotation to the BLD-LDAK Model (i.e., creating a new heritability model with 67 categories). To do this, we must first download the files bld1, bld2, ..., bld64 from the BLD-LDAK Annotations. Then we calculate the LDAK Weightings and rename them bld65, using the commands
./ldak.out --cut-weights sections --bfile human
./ldak.out --calc-weights-all sections --bfile human
mv sections/weights.short bld65
Now we change the name of the file set1 to bld66, and calculate the tagging file using
cp set1 bld66
./ldak.out --calc-tagging BLD-LDAK+set --bfile human --ignore-weights YES --power -.25 --window-cm 1 --annotation-number 66 --annotation-prefix bld
Finally, we regress the summary statistics onto this file.
./ldak.out --sum-hers snpher4 --summary quant.summaries --tagfile BLD-LDAK+set.tagging
The main output file is snpher4.enrich, which tells us the estimated enrichment of set1 (Category 66) is 0.95X (SD 0.2).