Gene-based Analysis

LDAK is able to perform set-based association analysis, using either individual-level data or summary statistics. We typically use this feature for gene-based testing. This is often more powerful than Single-Predictor Analysis, both because it reduces the multiple testing (for human data, there will be ~20,000 genes, compared to ~1M SNPs), and because there are biological reasons why we should jointly analyse predictors within the same gene. An alternative is to perform a chunk-based association analysis, where each chunk is, for example, a 100kb section of the genome (this has the advantage of not excluding non-genic predictors).

To test whether a set of predictors is associated with the phenotype, LDAK calculates whether their estimated heritability is significantly greater than zero. The naive way to do this, would be to first Calculate Kinships across each set of predictors, then regress the phenotype on each kinship matrix using REML. However, this would be very computationally intensive. Fortunately, when the sets are small (e.g., less than 1000 predictors), this analysis can be performed very efficiently.

Here, we explain how to perform set-based tests of associations. There are three steps: the first divides predictors into sets, the second tests each set for association with the phenotype, while the third joins results across sets.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Divide predictors into sets:

The main argument is --cut-genes <folder>.

This requires the options

--genefile <genefile>, --chunks <float> or --chunks-bp <integer> - to specify the sets of predictors (see below).

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

There are three ways to specify predictor sets. The first is to provide gene annotations using --genefile <genefile>. The first four columns of <genefile> should specify the name, chromosome, start and end basepairs of each gene. If there is a fifth column, its value should be either + or -, indicating whether the gene is on the forward or backwards strand. Note that the basepairs should provide 0-start, half-open positions (this is the format used by Browser Extensible Data). For example, if Gene ABC is on Chromosome 7 and contains basepairs 1-10, inclusive, the corresponding row of <genefile> would be "ABC 7 0 10" (if providing strand orientation, the row would be either "ABC 7 0 10 +" or "ABC 7 0 10 -").

I obtain gene annotations using the Tables feature on Genome Browser. Click here, then select the group "Genes and Gene Predictions" and the track you desire (e.g., "UCSC Genes"). Make sure "genome" is ticked (next to region), then choose "selected fields from primary and related tables" and enter a filename. Having clicked "get output", make sure you tick the fields "name", "chrom", "strand" and either "txStart" and "txEnd" or (my preference) "cdsStart" and "cdsEnd".

By default, LDAK will only consider predictors inside genes. To change this, you can use --gene-buffer <integer>, --up-buffer <integer> or --down-buffer <integer>. For example, if you use --gene-buffer 500, each set will contain predictors inside or within 500bp of a gene. Note that you can only use --up-buffer <integer> or --down-buffer <integer> if <genefile> provides the orientation of the gene (i.e., has five columns). If you add --overlap NO, LDAK will ensure that each predictor is only allocated to the nearest gene.

Instead of providing gene annotations, you can use --chunks <float> to divide predictors into chunks of weighting approximately <float> (you can use --weights <weightsfile> to provide predictor weightings, else each predictor will get weighting one). Alternatively, you can use --chunks-bp <integer> to divide predictors into chunks of length <integer> base pairs. By default, LDAK will construct overlapping chunks. For example, if you use --chunk-bp 1000, the first three sets on a chromosome will span basepairs 1-1000, basepairs 501-1500 and basepairs 1001-2000. However, if you add --overlap NO, then the first three sets on a chromosome will instead span basepairs 1-1000, basepairs 1001-2000 and basepairs 2001-3000.

If you add --partition-length or --by-chr YES, LDAK will divide the genome into partitions, which can be analyzed separately.

If you have predictor p-values (e.g., from Single-Predictor Analysis), you can provide these using the option --pvalues <pvalues>. The file <pvalues> should have one row per predictor and two columns, which provide predictor names then p-values. LDAK will then report the minimum p-value for each gene/chunk, and also Bonferroni-corrected p-values.

The main output file is <folder>/genes.details, which details which predictors are in each set (and perhaps the minimum and Bonferroni-corrected p-value for each set). The file <folder>/genes.predictors lists which predictors are within one or more sets. If you provided gene annotations, then the distance from each predictor to the nearest gene will be reported in the file <folder>/genes.distance (negative values indicate the predictor is within a gene).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Test each set for association:

The main argument is --calc-genes-reml <folder>.

This requires the options

--pheno <phenofile> or --summary <sumsfile> - to specify phenotypes (in PLINK format) or summary statistics (see Summary Statistics for details of how these should be formated). If analysing individual-level data, you should use --pheno <phenofile>. Note that samples without a phenotype will be excluded, while if <phenofile> contains more than one phenotype, you should specify which to use with --mpheno <integer>. If analysing summary statistics, you should use --summary <sumsfile>.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats). Note that if you are analysing summary statistics, the genetic data is used as a Reference Panel.

--weights <weightsfile> or --ignore-weights YES - to specify the predictor weightings (if using a weightsfile, this should have two columns, that provide predictor names then weightings). In general, we recommend using --ignore-weights YES.

--power <float> - to specify how predictors are scaled. In general, we recommend using --power -0.25.

If you used --partition-length or --by-chr YES when dividing predictors into sets, you must run this step multiple times, each time using --partition <integer> to specify which partition to analyse.

When analysing individual-level data, you can use --grm <kinfile> to provide a kinship matrix. LDAK will then include this when performing REML (in addition to the kinship matrix calculated across the set of predictors being tested). We recommend doing this when your dataset contains related samples and/or population structure (it is an effective way to guard against confouding due to cryptic relatedness). To construct the kinship matrix, we suggest Calculating Kinships assuming a thinned version of the GCTA Model, that restricts to SNPs in approximate linkage equilibrium (see below).

For each set of predictors, LDAK computes the likelihood ratio test statistic T, equal to twice the difference between the alternative likelihood (the model fit when the predictors in the set are allowed to contribute heritability) and the null likelihood (the model fit when the predictors are assumed to not contribute heritability). LDAK first computes a p-value assuming the distribution of T is 0.5 G(0.5,0.5), half a gamma distribution with shape and rate 0.5 (i.e., half a chi-squared distribution with one degree of freedom). However, as explained by Listgarten el al., this assumption is conservative. Therefore LDAK computes a second (more accurate) p-value by assuming T is distributed c G(a,b), where the shape, rate and fraction parameters (a, b & c) are estimated via permutations. By default, LDAK performs one permutation for each predictor set. To change this, use --gene-permutations <integer>.

When using summary statistics, LDAK will by default ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

When analysing individual-level data, you can use --covar <covarfile> to provide covariates (in PLINK format) as fixed effect in the regression. Further, if the data contains predictors that are definitely associated with the phenotype, you can include these as fixed effects using --top-preds <toppredsfile>.

To increase computational stability, prior to testing each set, LDAK prunes its predictors. When analysing individual-level data, it prunes so that no pair of predictors has squared correlation 0.98; when analysing summary statistics, it prunes so that no pair of predictors has squared correlation above 0.5. To change this threshold, use --gene-prune <float>.

If the phenotype is binary, you can use --prevalence <float> to specify the population prevalence. LDAK will then additionally report estimates on the liability scale. Note that if you are using summry statistics, you must also use --ascertainment <float> to specify the proportion of cases in the association study from which the summary statistics came.

If you add --permute YES,  the phenotypic values will be shuffled. This is useful if wishing to perform permutation analysis to see the distribution of p-values or test statistics when there is no true signal.

The results will be saved in the file <folder>/remls.1 (or in the files <folder>/remls.1, <folder>/remls.2, ... if you have multiple partitions).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Join results across sets:

The main argument is --join-genes-reml <folder>.

This requires no options.

The joined results will be saved in <folder>/remls.all. For each set, this file reports the estimated heritability and two p-values (for reasons explained above, we recommend that you use the second p-value, obtained using permutations).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype quant.pheno, and the annotations file anns.txt from the Test Datasets. The file anns.txt contains annotations for human genes on Chromosomes 21 & 22

head -n 5 anns.txt
MIR3648-2 21 9825831 9826011
MIR3687-2 21 9826202 9826263
TEKT4P2 21 9907188 9968594
TEKT4P2 21 9915249 9968594
TPTE 21 10906186 10990943

For example, we see that the Gene TEKT4P2 lies on Chromosome 21 from basepairs 9907189 to 9968594. Note that the same gene can have more than one entry (as is the case here). You may prefer to reduce to just one annotation per gene, but in our experience, it is fine to keep them all.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Basic gene-based analysis using individual-level data

We can run a gene-based analysis using the commands

./ldak.out --cut-genes genes --bfile human --genefile anns.txt
./ldak.out --calc-genes-reml genes --pheno quant.pheno --bfile human --ignore-weights YES --power -.25
./ldak.out --join-genes-reml genes

The final results are saved in genes/remls.all. We can find the most significant gene by sorting based on the likelihood ratio test statistic (Column 10)

head -n 1 genes/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 genes/remls.all |tail -n 1
ABCC13 21 15646120 15673692 20 0.096276 0.052390 -603.7353 -590.3308 26.8090 1.1230e-07 1.3031e-07 NA NA

We see that the most significant gene is ABCC13. Its estimated heritability is 0.10 (SD 0.05), while its test statistic is 27. There are two p-value, 1.1e-7 and 1.3e-7. We prefer the latter (Column 12), which was obtained via permutations (note that this p-value is stochastic, so will vary between runs).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Gene-based analysis including a genome-wide kinship matrix

We recommend computing the kinship matrix assuming a thinned version of the GCTA Model, for which we use the scripts

./ldak.out --thin le --bfile human --window-prune .05 --window-cm 1
./ldak.out --calc-kins-direct le --bfile human --ignore-weights YES --power -1 --extract le.in

We then repeat the analysis above, except adding this kinship matrix when testing each set of predictors.

./ldak.out --cut-genes genes2 --bfile human --genefile anns.txt
./ldak.out --calc-genes-reml genes2 --pheno quant.pheno --bfile human --ignore-weights YES --power -.25 --grm le
./ldak.out --join-genes-reml genes2

The final results are saved in genes2/remls.all. Again we sort genes based on the likelihood ratio test statistic (Column 10)

head -n 1 genes2/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 genes2/remls.all |tail -n 1
ABCC13 21 15646120 15673692 20 0.087546 0.050512 -599.7448 -588.0336 23.4224 6.5027e-07 1.4419e-04 NA NA

We see that the most significant gene remains ABCC13. However, now its estimated heritability is 0.09 (SD 0.05), its test statistic is 23 and its (permuted) p-value is 1.3e-4.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Basic gene-based analysis using summary statistics

Although we have individual-level data, we now pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

We now repeat the analysis above, except providing summary statistics instead of phenotypes

./ldak.out --cut-genes genes3 --bfile human --genefile anns.txt
./ldak.out --calc-genes-reml genes3 --summary quant.summaries --bfile human --ignore-weights YES --power -.25 --allow-ambiguous YES
./ldak.out --join-genes-reml genes3

Here, we added --allow-ambiguous YES in the second command, so that LDAK does not exclude SNPs with alleles A & T or C & G (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent). The final results are saved in genes3/remls.all. Again we sort genes based on the likelihood ratio test statistic (Column 10)

head -n 1 genes3/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 genes3/remls.all |tail -n 1
ABCC13 21 15646120 15663706 12 0.068706 0.039202 -599.4063 -586.3585 26.0956 1.6246e-07 5.3744e-04 NA NA

The most significant gene remains ABCC13. Its estimated heritability is 0.07 (SD 0.04), its test statistic is 26 and its (permuted) p-value is 5.4e-4. Note that these values differ from those obtained using individual-level data, because prior to testing sets, LDAK has pruned its predictors more strongly (if you added --gene-prune 0.98 in the second command, the results would be almost identical).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

4 - Basic chunk-based analysis using individual-level data

We can run a chunk-based analysis using the commands

./ldak.out --cut-genes chunks --bfile human --chunks-bp 100000
./ldak.out --calc-genes-reml chunks --pheno quant.pheno --bfile human --ignore-weights YES --power -.25
./ldak.out --join-genes-reml chunks

Here we have used 100kb, overlapping chunks. The final results are saved in chunks/remls.all. Again we sort genes based on the likelihood ratio test statistic (Column 10)

head -n 1 chunks/remls.all
Gene_Name Gene_Chr Gene_Start Gene_End Length Heritability SD Null_Likelihood Alt_Likelihood LRT_Stat LRT_P_Raw LRT_P_Perm Her_Liability SD_Liability

sort -n -k 10 chunks/remls.all |tail -n 1
Chunk_19 21 15550000 15650000 25 0.192006 0.086320 -603.7353 -580.9097 45.6511 7.0653e-12 3.1972e-13 NA NA

We see that the most significant chunk is on Chromosome 21 from basepairs 15,550,000 to 15,650,000. Its estimated heritability is 0.19 (SD 0.09), while its test statistic is 46 and its (permuted) p-value is 3.2e-12.