Gene-based Analysis

LDAK is able to perform set-based association analysis. We typically use this feature for gene-based testing. However, an alternative is to perform a chunk-based association analysis (e.g., where each chunk is a 100kb section of the genome). Please note that this page is very general; it explains both gene-based and chunk-based analyses, using both individual-level data and summary statistics. If you are only interested in gene-based association testing using summary statistics, you should instead read the instructions for LDAK-GBAT.

To perform a set-based analysis involves three steps: the first divides predictors into sets, the second tests each set for association with the phenotype, while the third joins results across sets. If your sets correspond to genes, the first step requires gene annotations. On Resources, you can download gene annotations for the human genome. The page also explains how I created these files, which might be useful if you wish to construct annotations yourself (e.g., for non-human species).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Divide predictors into sets:

The main argument is --cut-genes <folder>.

This requires the options

--genefile <genefile>, --chunks <float> or --chunks-bp <integer> - to specify the sets of predictors (see below).

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

There are three ways to specify predictor sets. The first is to provide gene annotations using --genefile <genefile>. The first four columns of <genefile> should specify the name, chromosome, start and end basepairs of each gene. If there is a fifth column, its value should be either + or -, indicating whether the gene is on the forward or backwards strand. Note that the basepairs should provide 0-start, half-open positions (this is the format used by Browser Extensible Data). For example, if Gene ABC is on Chromosome 7 and contains basepairs 1-10, inclusive, the corresponding row of <genefile> would be "ABC 7 0 10" (if providing strand orientation, the row would be either "ABC 7 0 10 +" or "ABC 7 0 10 -").

By default, LDAK will only consider predictors inside genes. To change this, you can use --gene-buffer <integer>, --up-buffer <integer> or --down-buffer <integer>. For example, if you use --gene-buffer 500, each set will contain predictors inside or within 500bp of a gene. Note that you can only use --up-buffer <integer> or --down-buffer <integer> if <genefile> provides the orientation of the gene (i.e., has five columns). If you add --overlap NO, LDAK will ensure that each predictor is only allocated to the nearest gene.

Instead of providing gene annotations, you can use --chunks <float> to divide predictors into chunks of weighting approximately <float> (you can use --weights <weightsfile> to provide predictor weightings, else each predictor will get weighting one). Alternatively, you can use --chunks-bp <integer> to divide predictors into chunks of length <integer> base pairs. By default, LDAK will construct overlapping chunks. For example, if you use --chunk-bp 1000, the first three sets on a chromosome will span basepairs 1-1000, basepairs 501-1500 and basepairs 1001-2000. However, if you add --overlap NO, then the first three sets on a chromosome will instead span basepairs 1-1000, basepairs 1001-2000 and basepairs 2001-3000.

If you add --partition-length or --by-chr YES, LDAK will divide the genome into partitions, which can be analysed separately.

If you have predictor p-values (e.g., from Single-Predictor Analysis), you can provide these using the option --pvalues <pvalues>. The file <pvalues> should have one row per predictor and two columns, which provide predictor names then p-values. LDAK will then report the minimum p-value for each gene/chunk, and also Bonferroni-corrected p-values.

The main output file is <folder>/genes.details, which details which predictors are in each set (and perhaps the minimum and Bonferroni-corrected p-value for each set). The file <folder>/genes.predictors lists which predictors are within one or more sets. If you provided gene annotations, then the distance from each predictor to the nearest gene will be reported in the file <folder>/genes.distance (negative values indicate the predictor is within a gene).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Test each set for association:

The main argument is --calc-genes-reml <folder>.

This requires the options

--pheno <phenofile> or --summary <sumsfile> - to specify phenotypes (in PLINK format) or summary statistics (see Summary Statistics for details of how these should be formatted). If analysing individual-level data, you should use --pheno <phenofile> (note that samples without a phenotype will be excluded, while if <phenofile> contains more than one phenotype, you should specify which to use with --mpheno <integer>). If analysing summary statistics, you should use --summary <sumsfile>.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--weights <weightsfile> or --ignore-weights YES - to specify the predictor weightings (if using a weightsfile, this should have two columns, that provide predictor names then weightings). In general, we recommend using --ignore-weights YES.

--power <float> - to specify how predictors are scaled. In general, we recommend using --power -0.25.

If you used --partition-length or --by-chr YES when dividing predictors into sets, you must run this step multiple times, each time using --partition <integer> to specify which partition to analyse.

When analysing individual-level data, you can use --grm <kinfile> to provide a kinship matrix. LDAK will then include this when performing REML (in addition to the kinship matrix calculated across the set of predictors being tested). We recommend doing this when your dataset contains related samples and/or population structure (it is an effective way to guard against confouding due to cryptic relatedness). To construct the kinship matrix, we suggest Calculating Kinships assuming a thinned version of the GCTA Model, that restricts to SNPs in approximate linkage equilibrium (see below).

For each set of predictors, LDAK computes the likelihood ratio test statistic T, equal to twice the difference between the alternative likelihood (the model fit when the predictors in the set are allowed to contribute heritability) and the null likelihood (the model fit when the predictors are assumed to not contribute heritability). LDAK first computes a p-value assuming the distribution of T is 0.5 G(0.5,0.5), half a gamma distribution with shape and rate 0.5 (i.e., half a chi-squared distribution with one degree of freedom). However, as explained by Listgarten el al., this assumption is conservative. Therefore LDAK computes a second (more accurate) p-value by assuming T is distributed c G(a,b), where the shape, rate and fraction parameters (a, b & c) are estimated via permutations. By default, LDAK performs ten permutation for each predictor set; to change this, use --gene-permutations <integer>. By default, the permutations will vary across runs. However, you can prevent this using --random-seed <integer> (see Advanced Options for more details).

When analysing individual-level data, you can useĀ  --covar <covarfile> to provide covariates (in PLINK format) as fixed effect in the regression. Further, if the data contains predictors that are definitely associated with the phenotype, you can include these as fixed effects using --top-preds <toppredsfile>.

To reduce runtime and increase computational stability, LDAK prunes the predictors in each set so that no pair of predictors has squared correlation above a threshold 0.98. By default, this threshold is 0.98 if analysing individual-level data, and 0.5 if analysing summary statistics; to change this threshold, use --gene-prune <float>.

When analysing a binary phenotype, you can use --prevalence <float> to specify the population prevalence. LDAK will then additionally report heritability estimates on the liability scale. Note that if using summary statistics, you will also have to use --ascertainment <float> to provide the proportion of cases in the GWAS.

When analysing individual-level data, you can add --permute YES in order to shuffle the phenotypes (this permutation is separate to those used to calibrate p-values, described above). This option is useful if wishing to perform permutation analysis to determine the distribution of p-values or test statistics when there is no true signal.

Provisional association testing results will be saved in the file <folder>/remls.1 (or in the files <folder>/remls.1, <folder>/remls.2, ... if you have multiple partitions). Meanwhile, the estimated genetic contributions of each set will be saved in files with the prefix <folder>/prs.1 (or with prefixes <folder>/prs.1, <folder>/prs.2, ... if you have multiple partitions).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Join results across sets:

The main argument is --join-genes-reml <folder>.

This requires no options.

The final association testing results will be saved in <folder>/remls.all. For each set, this file reports the estimated heritability and two p-values (for reasons explained above, we recommend that you use the second p-value, obtained using permutations). The estimated genetic contributions of each set will be saved in SP format in the files <folder>/prs.all.sp, <folder>/prs.all.bim and <folder>/prs.all.fam (see Genetic File Formats for details), while the corresponding p-values are saved in <folder>/prs.all.pvalues. Note that to ensure predictor names are unique, LDAK will prefix gene names by their index.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype quant.pheno, and the annotations file anns.txt from the Test Datasets. The file anns.txt contains annotations for human genes on Chromosomes 21 & 22

head -n 5 anns.txt
MIR3648-2 21 9825831 9826011
MIR3687-2 21 9826202 9826263
TEKT4P2 21 9907188 9968594
TEKT4P2 21 9915249 9968594
TPTE 21 10906186 10990943

For example, we see that the Gene TEKT4P2 lies on Chromosome 21 from basepairs 9907189 to 9968594. Note that the same gene can have more than one entry (as is the case here). You may prefer to reduce to just one annotation per gene, but in our experience, it is fine to keep them all.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Basic gene-based analysis

We can run a gene-based analysis using the commands

./ldak.out --cut-genes genes --bfile human --genefile anns.txt
./ldak.out --calc-genes-reml genes --pheno quant.pheno --bfile human --ignore-weights YES --power -.25
./ldak.out --join-genes-reml genes

The final association testing results are saved in genes/remls.all, while estimates of the genetic contributions of each gene are saved in the files genes/remls.all.sp, genes/remls.all.bim and genes/remls.all.fam (the latter files can be used for Clumping). We can find the most significant gene by sorting based on the likelihood ratio test statistic (Column 10)

head -n 1 genes/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 genes/remls.all |tail -n 1
ABCC13 21 15646120 15673692 20 0.096276 0.052390 -603.7353 -590.3308 26.8090 1.1230e-07 1.3031e-07 NA NA

We see that the most significant gene is ABCC13. Its estimated heritability is 0.10 (SD 0.05), while its test statistic is 27. There are two p-value, 1.1e-7 and 1.3e-7. We prefer the second p-value (Column 12), which was obtained via permutations (note that this p-value is stochastic, so will vary between runs).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Gene-based analysis including a genome-wide kinship matrix

We recommend computing the kinship matrix assuming a thinned version of the GCTA Model, for which we use the scripts

./ldak.out --thin le --bfile human --window-prune .05 --window-cm 1
./ldak.out --calc-kins-direct le --bfile human --ignore-weights YES --power -1 --extract le.in

We then repeat the analysis above, except adding this kinship matrix when testing each set of predictors.

./ldak.out --cut-genes genes2 --bfile human --genefile anns.txt
./ldak.out --calc-genes-reml genes2 --pheno quant.pheno --bfile human --ignore-weights YES --power -.25 --grm le
./ldak.out --join-genes-reml genes2

The final association testing results are saved in genes2/remls.all, while estimates of the genetic contributions of each gene are saved in the files genes2/remls.all.sp, genes2/remls.all.bim and genes2/remls.all.fam (the latter files can be used for Clumping). Again we sort genes based on the likelihood ratio test statistic (Column 10)

head -n 1 genes2/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 genes2/remls.all |tail -n 1
ABCC13 21 15646120 15673692 20 0.087546 0.050512 -599.7448 -588.0336 23.4224 6.5027e-07 1.4419e-04 NA NA

We see that the most significant gene remains ABCC13. However, now its estimated heritability is 0.09 (SD 0.05), its test statistic is 23 and its (permuted) p-value is 1.3e-4.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Basic chunk-based analysis

We can run a chunk-based analysis using the commands

./ldak.out --cut-genes chunks --bfile human --chunks-bp 100000
./ldak.out --calc-genes-reml chunks --pheno quant.pheno --bfile human --ignore-weights YES --power -.25
./ldak.out --join-genes-reml chunks

Here we have used 100kb, overlapping chunks. The final association testing results are saved in chunks/remls.all, while estimates of the genetic contributions of each chunk are saved in the files chunks/remls.all.sp, chunks/remls.all.bim and chunks/remls.all.fam (the latter files can be used for Clumping). Again we sort genes based on the likelihood ratio test statistic (Column 10)

head -n 1 chunks/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 chunks/remls.all |tail -n 1
Chunk_19 21 15550000 15650000 25 0.192006 0.086320 -603.7353 -580.9097 45.6511 7.0653e-12 3.1972e-13 NA NA

We see that the most significant chunk is on Chromosome 21 from basepairs 15,550,000 to 15,650,000. Its estimated heritability is 0.19 (SD 0.09), while its test statistic is 46 and its (permuted) p-value is 3.2e-12.