Calculate Taggings

All applications of SumHer require a tagging file, which records the (relative) expected heritability tagged by each predictor. To create this tagging file requires a (well-matched) Reference Panel. We recommend using an extensive panel (e.g., imputed or sequencing data, retaining SNPs with MAF above 0.005 and information score above 0.8).

When calculating the tagging file, you must specify a Heritability Model. We recommend using the Human Default Model when estimating SNP Heritability or Genetic Correlations, using the BLD-LDAK Model when estimating Heritability Enrichments (for pre-defined categories), and using the Alpha Model to estimate the selection-related parameter alpha.

On this page, we provide instructions for constructing tagging files assuming the Human-Default, BLD-LDAK and Alpha Models. If you wish to estimate enrichments for your own categories (i.e., categories not in the BLD-LDAK Model), you will have to construct a heritability model that includes these categories (once you have finished reading this page, see Heritability Enrichments for further details).

If this is your first time calculating tagging files, we suggest you begin with the Human-Default Model (this is the simplest model to use, and will help you learn the process). Finally, remember that if analysing human data, it generally suffices to use one of the Pre-computed Taggings, in which case there is no need to calculate the tagging file yourself.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --calc-tagging <outfile>

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--power <float> - to specify how predictors are scaled.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors (for more details, see Data Filtering).

To provide SNP annotations, you should use either --partition-number <integer> and --partition-prefix <prefix>, or --annotation-number <integer> and --annotation-prefix <prefix> (more details below).

By default, LDAK will assign all predictors weighting one (equivalent to using --ignore-weights YES), however, you can provide your own weightings using --weights <weightsfile>.

By default, LDAK will set the window size (how far to search for tagging predictors) to 1Mb; you can change this using --window-kb <float> or --window-cm <float>.

To save the heritability matrix, add --save-matrix YES (necessary if performing Prediction).

To parallelize this process, you can add --chr <integer> to calculate taggings for each chromosome separately, then combine these with the argument --join-tagging <output>, using the option --taglist <tagstems> to specify the per-chromosome tagging files (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Specifying the heritability model:

To understand this section, it will help if you have first read Heritability Model and Technical Details.

By default, LDAK assumes a one-parameter heritability model, in the form

E[h2j] = tau1 wj [fj(1-fj)](1+alpha)

where wj is the weighting for SNP j and fj is its minor allele frequency (MAF). To specify a one-parameter model, use the options --weights <weightsfile> to provide the weightings (or --ignore-weights YES to set wj=1) and --power <float> to provide alpha.

The simplest way to write a multi-parameter heritability model, is in the form

E[h2j] = tau1 a1j + ... + tauK aKj

where K is the number of annotations, and akj is the value of the kth annotation for SNP j. To use this form, you must use the options --ignore-weights YES and --power -1. Then there are two ways to provide the annotation values (to understand why there are two ways, see Heritability Enrichments).

The first way is to use --partition-number K and --partition-prefix <prefix>. LDAK will expect to find the files <prefix>1, ..., <prefix>K. Usually, <prefix>k will have two columns, that provide the SNP names then the corresponding values for Annotation k. If a SNP is not present in <prefix>k, it will get akj=0, while if the file has only one column, then all SNPs within the file get akj=1 (this means that if you have a binary annotation, the corresponding annotation file needs only contain the names of the SNPs within the category). The second way is to use --annotation-number K-1 and --annotation-prefix <prefix>. LDAK will read the first K-1 annotations from the files <prefix>1, ..., <prefix>K-1, then set aKj=1.

The more general way to write a multi-parameter heritability model, is in the form

E[h2j] = wj [fj(1-fj)](1+alpha) x (tau1 c1j + ... + tauK cKj)

where wj is the weighting of SNP j, fj is its MAF, while akj = wj [fj(1-fj)](1+alpha) ckj is its value for the kth annotation. Again, we can use either --partition-number K and --partition-prefix <prefix>, or --annotation-number K-1 and --partition-prefix <prefix>, except that this time, the file <prefix>k is used to provide ckj. Note that if we use --ignore-weights YES and --power -1, then wj [fj(1-fj)](1+alpha) =1, which is why we obtain the simpler form above.

The example below considers both one-parameter and multi-parameter heritability models.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Reference SNPs, Regression SNPs and Heritability SNPs:

For definitions of these three terms, see SNP Subsets. By default, the Reference SNPs will be all SNPs in the reference panel. This is our recommendation, however, if you prefer to specify a subset use --extract <extractfile> and/or --exclude <excludefile>. By default, the Regression SNPs will be all Reference SNPs. While this choice is adequate, there are computational advantages to reducing the number of Regression SNPs, in which case use --regression-predictors <regpredsfile>. By default, the Heritability SNPs will be all Reference SNPs. This is our recommendation, however if you prefer to specify a subset, use --heritability-predictors <herpredsfile>.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The taggings are saved in the file <outfile>.tagging. This contains one row for each predictor, whose final values express the relative expected heritability the predictor tags in terms of the parameters of the heritability model (i.e., using the notation provided in Technical Details, the kth value for Predictor j is the sum of aklr2kl, where l indexes neighbouring predictors). These are the values onto which the summary statistics are regressed in order to estimate parameters. If you add --save-matrix YES, the heritability matrix will be saved in <outfile>.matrix. This also contain one row for each predictor, whose values provide the annotations for the predictor (i.e., the kth value for Predictor j is akj).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Here we use the binary PLINK files human.bed, human.bim and human.fam from the Test Datasets (as the reference panel), as well as the summary statistics file height.txt, created in the example for Summary Statistics. Note that when analysing real data, you should use a much larger reference panel (e.g., imputed or sequence data, retaining all SNPs with MAF>.005 and information score >.8). For more details, see Reference Panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To calculate a tagging file under the Human-Default Model, run the command

./ldak.out --calc-tagging HumDef --bfile human --power -.25

The taggings are saved in HumDef.tagging. If analysing very large data, we can parallelise the process by computing taggings separately for each chromosome, then merging

for j in {21..22}; do
./ldak.out --calc-tagging HumDef$j --bfile human --power -.25 --chr $j

rm list.txt; for j in {21..22}; do echo "HumDef$j.tagging" >> list.txt; done
./ldak.out --join-tagging HumDef --taglist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

By default, the Reference SNPs, Regression SNPs and Heritability SNPs are all SNPs in the reference panel. In general, there is no reason to change the Reference SNPs or Heritability SNPs, however, there can be computational advantages to changing the Regression SNPs (see SNP Subsets for more details).

The commands below recomputes the tagging file (again assuming the Human-Default Model), but now reduces the Regression SNPs to those for which we have summary statistics in the file height.txt.

awk < height.txt '(NR>1){print $1}' > height.snps
./ldak.out --calc-tagging HumDef.height --bfile human --power -.25 --regression-predictors height.snps

Note that because the first column of height.txt contains the predictor names, it is equivalent to use the command

./ldak.out --calc-tagging HumDef.height --bfile human --power -.25 --regression-predictors height.txt
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

If you wish to copy LDSC, you should calculate taggings assuming the GCTA Model, which can be achieved using the command

./ldak.out --calc-tagging GCTA --bfile human --power -1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To calculate taggings assuming the BLD-LDAK Model, we first download the files bld1, bld2, ..., bld64 from the BLD-LDAK Annotations. Then we calculate the LDAK Weightings and rename them bld65, using the commands

./ldak.out --cut-weights sections --bfile human
./ldak.out --calc-weights-all sections --bfile human
mv sections/weights.short bld65

Finally, we calculate taggings using the commands

./ldak.out --calc-tagging BLD-LDAK --bfile human --power -.25 --annotation-number 65 --annotation-prefix bld
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To specify the Alpha Model, we must calculate taggings files for multiple values of the power parameter alpha. Specifically, we recommend calculating taggings for 31 values ranges from -1 to 0.5 (step size 0.05) using the commands

for j in {-20..10}; do
alpha=`echo $j | awk '{print -$1/20}'`; echo $alpha
./ldak.out --calc-tagging Alpha.$j --bfile human --power $alpha