All applications of SumHer require a tagging file, which records the (relative) expected heritability tagged by each predictor. To create this tagging file requires a (well-matched) Reference Panel. We recommend using an extensive panel (e.g., imputed or sequencing data, retaining SNPs with MAF above 0.005 and information score above 0.8).
When calculating the tagging file, you must specify a Heritability Model. Our recommended heritability model depends on the type of data you are using and your aim. When analysing human data, we recommend using the BLD-LDAK Model to estimate SNP Heritability or Heritability Enrichments (for pre-defined categories), using the LDAK-Thin Model to estimate Genetic Correlations, and using the BLD-LDAK+Alpha Model to estimate the selection-related parameter alpha. When analysing non-human data, we recommend always using the LDAK-Thin Model.
On this page, we provide instructions for constructing tagging files assuming the LDAK-Thin, BLD-LDAK and BLD-LDAK+Alpha Models. If you wish to estimate enrichments for your own categories (i.e., categories not in the BLD-LDAK Model), you will have to construct a heritability model that includes these categories (once you have finished reading this page, see Heritability Enrichments for further details).
If this is your first time calculating tagging files, we suggest you begin with the LDAK-Thin Model (this is the simplest model to use, and will help you learn the process). Finally, remember that if analysing human data, it generally suffices to use one of the Pre-computed Taggings, in which case there is no need to calculate the tagging file yourself.
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --calc-tagging <outfile>
This requires the options
--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).
--weights <weightsfile> or --ignore-weights YES - to specify the predictor weightings (if using a weightsfile, this should have two columns, that provide predictor names then weightings).
--power <float> - to specify how predictors are scaled.
--window-cm <float> - to specify the window size (how far to search for tagging predictors). It generally suffices to use 1cM. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.
You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors (for more details, see Data Filtering).
To provide SNP annotations, you should use either --partition-number <integer> and --partition-prefix <prefix>, or --annotation-number <integer> and --annotation-prefix <prefix> (more details below).
To save the heritability matrix, add --save-matrix YES (necessary if performing Prediction).
To parallelize this process, you can add --chr <integer> to calculate taggings for each chromosome separately, then combine these with the argument --join-tagging <output>, using the option --taglist <tagstems> to specify the per-chromosome tagging files (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Specifying the heritability model:
To understand this section, it will help if you have first read Heritability Model and Technical Details.
By default, LDAK assumes a one-parameter heritability model, in the form
E[h2j] = tau1 wj [fj(1-fj)](1+alpha)
where wj is the weighting for SNP j and fj is its minor allele frequency (MAF). To specify a one-parameter model, use the options --weights <weightsfile> to provide the weightings (or --ignore-weights YES to set wj=1) and --power <float> to provide alpha.
The simplest way to write a multi-parameter heritability model, is in the form
E[h2j] = tau1 a1j + ... + tauK aKj
where K is the number of annotations, and akj is the value of the kth annotation for SNP j. To use this form, you must use the options --ignore-weights YES and --power -1. Then there are two ways to provide the annotation values (to understand why there are two ways, see Heritability Enrichments).
The first way is to use --partition-number K and --partition-prefix <prefix>. LDAK will expect to find the files <prefix>1, ..., <prefix>K. Usually, <prefix>k will have two columns, that provide the SNP names then the corresponding values for Annotation k. If a SNP is not present in <prefix>k, it will get akj=0, while if the file has only one column, then all SNPs within the file get akj=1 (this means that if you have a binary annotation, the corresponding annotation file needs only contain the names of the SNPs within the category). The second way is to use --annotation-number K-1 and --annotation-prefix <prefix>. LDAK will read the first K-1 annotations from the files <prefix>1, ..., <prefix>K-1, then set aKj=1.
The more general way to write a multi-parameter heritability model, is in the form
E[h2j] = wj [fj(1-fj)](1+alpha) x (tau1 c1j + ... + tauK cKj)
where wj is the weighting of SNP j, fj is its MAF, while akj = wj [fj(1-fj)](1+alpha) ckj is its value for the kth annotation. Again, we can use either --partition-number K and --partition-prefix <prefix>, or --annotation-number K-1 and --partition-prefix <prefix>, except that this time, the file <prefix>k is used to provide ckj. Note that if we use --ignore-weights YES and --power -1, then wj [fj(1-fj)](1+alpha) =1, which is why we obtain the simpler form above.
The example below considers both one-parameter and multi-parameter heritability models.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Reference SNPs, Regression SNPs and Heritability SNPs:
For definitions of these three terms, see SNP Subsets. By default, the Reference SNPs will be all SNPs in the reference panel. This is our recommendation, however, if you prefer to specify a subset use --extract <extractfile> and/or --exclude <excludefile>. By default, the Regression SNPs will be all Reference SNPs. While this choice is adequate, there are computational advantages to reducing the number of Regression SNPs, in which case use --regression-predictors <regpredsfile>. By default, the Heritability SNPs will be all Reference SNPs. This is our recommendation, however if you prefer to specify a subset, use --heritability-predictors <herpredsfile>.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The taggings are saved in the file <outfile>.tagging. This contains one row for each predictor, whose final values express the relative expected heritability the predictor tags in terms of the parameters of the heritability model (i.e., using the notation provided in Technical Details, the kth value for Predictor j is the sum of aklr2kl, where l indexes neighbouring predictors). These are the values onto which the summary statistics are regressed in order to estimate parameters. If you add --save-matrix YES, the heritability matrix will be saved in <outfile>.matrix. This also contain one row for each predictor, whose values provide the annotations for the predictor (i.e., the kth value for Predictor j is akj).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam from the Test Datasets (as the reference panel), as well as the summary statistics file height.txt, created in the example for Summary Statistics. Note that when analysing real data, you should use a much larger reference panel (e.g., imputed or sequence data, retaining all SNPs with MAF>.005 and information score >.8). For more details, see Reference Panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
To calculate a tagging file under the LDAK-Thin Model, we must first create a weightsfile that gives weighting one to the predictors that remain after thinning for duplicates, and weighting zero to those removed. This can be achieved using the commands
./ldak.out --thin thin --bfile human --window-prune .98 --window-kb 100
awk < thin.in '{print $1, 1}' > weights.thin
This first produces the file thin.in, which contains a list of predictors that remain after thinning, then these predictors are given weighting one in the file weights.thin (note that predictors not in weights.thin are automatically given weighting zero). Then we calculate taggings using the command
./ldak.out --calc-tagging LDAK-Thin --bfile human --weights weights.thin --power -.25 --window-cm 1
The taggings are saved in LDAK-Thin.tagging. If analysing very large data, we can parallelise the process by computing taggings separately for each chromosome, then merging
for j in {21..22}; do
./ldak.out --calc-tagging LDAK-Thin$j --bfile human --weights weights.thin --power -.25 --window-cm 1 --chr $j
done
rm list.txt; for j in {21..22}; do echo "LDAK-Thin$j.tagging" >> list.txt; done
./ldak.out --join-tagging LDAK-Thin --taglist list.txt
Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
By default, the Reference SNPs, Regression SNPs and Heritability SNPs are all SNPs in the reference panel. In general, there is no reason to change the Reference SNPs or Heritability SNPs, however, there can be computational advantages to changing the Regression SNPs (see SNP Subsets for more details).
The commands below recomputes the tagging file (again assuming the LDAK-Thin Model), but now reduces the Regression SNPs to those for which we have summary statistics in the file height.txt.
awk < height.txt '(NR>1){print $1}' > height.snps
./ldak.out --calc-tagging LDAK-Thin.height --bfile human --weights weights.thin --power -.25 --window-cm 1 --regression-predictors height.snps
Note that because the first column of height.txt contains the predictor names, it is equivalent to use the command
./ldak.out --calc-tagging LDAK-Thin.height --bfile human --weights weights.thin --power -.25 --window-cm 1 --regression-predictors height.txt
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Previously, we recommended calculating taggings assuming the LDAK Model, which can be achieved using the command
./ldak.out --calc-tagging LDAK --bfile human --weights <weightsfile> --power -.25 --window-cm 1
where <weightsfile> contains the LDAK weightings (obtained using Calculate Weightings). If you wished to copy LDSC, you should calculate taggings assuming the GCTA Model, which can be achieved using the command
./ldak.out --calc-tagging GCTA --bfile human --ignore-weights YES --power -1 --window-cm 1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
To calculate taggings assuming the BLD-LDAK Model, we first download the files bld1, bld2, ..., bld64 from the BLD-LDAK Annotations. Then we calculate the LDAK Weightings and rename them bld65, using the commands
./ldak.out --cut-weights sections --bfile human
./ldak.out --calc-weights-all sections --bfile human
mv sections/weights.short bld65
Finally, we calculate taggings using the commands
./ldak.out --calc-tagging BLD-LDAK --bfile human --ignore-weights YES --power -.25 --window-cm 1 --annotation-number 65 --annotation-prefix bld
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
To specify the BLD-LDAK+Alpha Model, we must calculate taggings for multiple instances of the BLD-LDAK Model (each time changing the value of alpha). Therefore, as above, we first download the files bld1, bld2, ..., bld64 from the BLD-LDAK Annotations. Then we calculate the LDAK Weightings and rename them bld65, using the commands
./ldak.out --cut-weights sections --bfile human
./ldak.out --calc-weights-all sections --bfile human
mv sections/weights.short bld65
Finally, we calculate taggings for alpha between -1 and 0.5 (step size 0.05) using the commands
for j in {-20..10}; do
alpha=`echo $j | awk '{print -$1/20}'`; echo $alpha
./ldak.out --calc-tagging BLD-LDAK+Alpha.$j --bfile human --ignore-weights YES --power $alpha --window-cm 1 --annotation-number 65 --annotation-prefix bld
done