LDAK-GBAT

LDAK-GBAT is our tool for gene-based association testing using summary statistics (if you are instead analyzing individual-level data, or wish to perform a chunk-based analysis, see Gene-Based Analysis). In our recent preprint LDAK-GBAT: fast and powerful gene-based association testing using summary statistics (currently on Biorxiv), we showed that LDAK-GBAT was more powerful than five existing tools for gene-based association testing (including MAGMA, GCTA-fastBAT and SKATO).

Note that this page provides generic instructions for LDAK-GBAT. If you would instead like to see the scripts we used for the analyses in our preprint (as well as the scripts we used when running the existing tools), see Takiy's GitHub.

To run LDAK-GBAT involves three steps: the first divides predictors into genes, the second tests each genes for association with the phenotype, while the third joins results across genes. Note that in the second step, you must specify a Heritability Model. In general, we recommend the "Human Default Model" (achieved using --power -0.25, as explained below).

Prior to running LDAK-GBAT, you must first ensure your summary statistics are in the format required by LDAK (see Summary Statistics for details). Further, you will require a (well-matched) reference panel. Ideally, this should contain at least 2000 samples (otherwise, Reference Panel provides scripts for constructing a smaller panel from 1000 Genome Project data). Lastly, you will require gene annotations. On Resources, you can download gene annotations for the human genome. The page also explains how I created these files, which might be useful if you wish to construct annotations yourself (e.g., for non-human species).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Divide predictors into genes:

The main argument is --cut-genes <folder>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--genefile <genefile> - to specify the gene annotations file. The first four columns of <genefile> should specify the name, chromosome, start and end basepairs of each gene. If there is a fifth column, its value should be either + or -, indicating whether the gene is on the forward or backwards strand. Note that the basepairs should provide 0-start, half-open positions (this is the format used by Browser Extensible Data). For example, if Gene ABC is on Chromosome 7 and contains basepairs 1-10, inclusive, the corresponding row of <genefile> would be "ABC 7 0 10" (if providing strand orientation, the row would be either "ABC 7 0 10 +" or "ABC 7 0 10 -").

By default, LDAK will only consider predictors inside genes. To change this, you can use --gene-buffer <integer>, --up-buffer <integer> or --down-buffer <integer>. For example, if you use --gene-buffer 500, each gene will contain predictors inside or within 500bp of a gene. Note that you can only use --up-buffer <integer> or --down-buffer <integer> if <genefile> provides the orientation of the gene (i.e., has five columns). If you add --overlap NO, LDAK will ensure that each predictor is only allocated to the nearest gene.

For details of other available options, see Gene-Based Analysis. For example, this page assumes you are performing a gene-based analysis, but if you replace --genefile <genefile> with --chunks <float> or --chunks <float>, LDAK will instead perform a chunk-based analysis (this is useful if you do not have gene annotations, or wish to include non-genic predictors).

The main output file is <folder>/genes.details, which details which predictors are in each gene. The file <folder>/genes.predictors.used lists which predictors are within one or more gene, while <folder>/genes.distance reports the distance from each predictor to the nearest gene (negative values indicate the predictor is within a gene).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Test each gene for association:

The main argument is --calc-genes-reml <folder>.

This requires the options

--summary <sumsfile> - to specify the file containing the summary statistics.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats). Note that if you are analysing summary statistics, the genetic data is used as a Reference Panel.

--power <float> - to specify how predictors are scaled. In general, we recommend using --power -0.25.

By default, LDAK will ignore predictors with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

To increase computational stability and robustness to choice of reference panel, LDAK prunes the predictors in each gene. By default, it prunes so that no pair of predictors has squared correlation 0.5; to change this threshold, use --gene-prune <float>.

LDAK uses permutations to derive the null distribution of the likelihood ratio test statistic. By default, LDAK performs ten permutation for each gene; to change this, use --gene-permutations <integer>. By default, the permutations will vary across runs. However, you can prevent this using --random-seed <integer> (see Advanced Options for more details).

By default, LDAK will assign all predictors weighting one (equivalent to using --ignore-weights YES). You can provide your own weightings using --weights <weightsfile>, but we have not found this beneficial.

Provisional association testing results will be saved in the file <folder>/remls.1, while the estimated genetic contributions of each gene will be saved in files with the prefix <folder>/prs.1.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Join results across genes:

The main argument is --join-genes-reml <folder>.

This requires no options.

The final association testing results will be saved in <folder>/remls.all. For each gene, this file reports the estimated heritability and two p-values (we recommend that you use the second p-value, obtained using permutations). The estimated genetic contributions of all gene will be saved in SP format in the files <folder>/prs.all.sp, <folder>/prs.all.bim and <folder>/prs.all.fam (see Genetic File Formats for details), while the corresponding p-values are saved in <folder>/prs.all.pvalues.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype quant.pheno, and the annotations file anns.txt from the Test Datasets. The file anns.txt contains annotations for human genes on Chromosomes 21 & 22

head -n 5 anns.txt
MIR3648-2 21 9825831 9826011
MIR3687-2 21 9826202 9826263
TEKT4P2 21 9907188 9968594
TEKT4P2 21 9915249 9968594
TPTE 21 10906186 10990943

For example, we see that the Gene TEKT4P2 lies on Chromosome 21 from basepairs 9907189 to 9968594. Note that the same gene can have more than one entry (as is the case here). You may prefer to reduce to just one annotation per gene, but in our experience, it is fine to keep them all.

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Divide predictors into genes

We run the command

./ldak.out --cut-genes gbat --bfile human --genefile anns.txt

Details of which predictors belong to each gene are saved in gbat/gene.details. Here, LDAK will only consider predictors within each gene. However, if you wanted, for example, to also include predictors within 500bp of each gene, you should add --gene-buffer 500.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Test each gene for association

We run the command

./ldak.out --calc-genes-reml gbat --summary quant.summaries --bfile human --power -.25 --allow-ambiguous YES

Here, we added --allow-ambiguous YES, so that LDAK does not exclude SNPs with alleles A & T or C & G (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent).

Provisional association testing results are saved in gbat/remls.1.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Join results across genes

We run the command

./ldak.out --join-genes-reml gbat

The final results are saved in gbat/remls.all, while estimates of the genetic contributions of each gene are saved in the files gbat/prs.all.sp, gbat/prs.all.bim and gbat/prs.all.fam (the latter files can be used for Clumping). We can find the most significant gene by sorting based on the likelihood ratio test statistic (Column 10)

head -n 1 gbat/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 gbat/remls.all|tail -n 1
ABCC13 21 15646120 15663706 12 0.068706 0.039202 -599.4063 -586.3585 26.0956 1.6246e-07 7.3779e-07 NA NA

We see that the most significant gene is ABCC13. Its estimated heritability is 0.10 (SD 0.05), while its test statistic is 26. There are two p-value, 1.6e-7 and 7.4e-7. We prefer the second p-value (Column 12), which was obtained via permutations (note that this p-value is stochastic, so will vary between runs).