Single-Predictor Analysis

Here we explain how to perform one-predictor-at-a-time analysis, using linear regression (either classical or mixed-model) or logistic regression (only classical). When analysing a binary phenotype, it is theoretically better to use logistic regression, but in practice, linear regression usually suffices (and is substantially faster).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Linear regression:

The main argument is --linear <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--pheno <phenofile> - to specify phenotypes (in PLINK format). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

By default, LDAK will perform classical linear regression. To instead perform mixed-model linear regression, you should first Calculate Kinships, then provide the kinship matrix using --grm <kinfile>. Typically, this kinship matrix is computed assuming a thinned version of the GCTA Model, that restricts to SNPs in approximate linkage equilibrium. Note that it is common to use leave-one-chromosome-out (LOCO) analysis to avoid proximal contamination (explained in these papers by Lippert et al. and Yang et al.). To perform LOCV, you should use --chr <integer> to analyse one chromosome at a time, then provide a kinship matrix calculated across all other chromosomes. For more details, see the example below.

You can use --covar <covarfile> to provide covariates (in PLINK format) as fixed effect in the regression. If the data contains predictors that are definitely associated with the phenotype, you can include these as fixed effects using --top-preds <toppredslist>. This is useful if wishing to perform a conditional analysis (i.e., to find secondary associations).

To perform weighted linear regression, use --sample-weights <sampleweightfile> (this is not possible if providing a kinship matrix). The file <sampleweightfile> should have three columns, where each row provides two sample IDs followed by a positive float. Note that by default, LDAK will use the sandwich estimator of the effect size variance (see this page for an explanation); to instead revert to the standard estimator of variance, add --sandwich NO.

If the phenotype is binary, you can use --prevalence <float> to specify the population prevalence. LDAK will then additionally report estimates on the liability scale.

If you add --permute YES - the phenotypic values will be shuffled. This is useful if wishing to perform permutation analysis to see the distribution of p-values or test statistics when there is no true signal.

The main output files is <outfile>.assoc, which provides full results for each predictor. <outfile>.summaries provides the results necessary for performing analyses using SumHer, while <outfile>.pvalues contains just the p-values. <outfile>.coeff provides estimates of the fixed effects, while <outfile>.score contains prediction models corresponding to six different p-value thresholds.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Logistic regression:

The main argument is --logistic <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--pheno <phenofile> - to specify phenotypes (in PLINK format). The phenotype must be binary (either cases 1 and controls 0, or cases 2, controls 1). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

You can use --covar <covarfile> to provide covariates (in PLINK format) as fixed effect in the regression. If the data contains predictors that are definitely associated with the phenotype, you can include these as fixed effects using --top-preds <toppredslist>. This is useful if wishing to perform a conditional analysis (i.e., to find secondary associations).

If you add --permute YES - the phenotypic values will be shuffled. This is useful if wishing to perform permutation analysis to see the distribution of p-values or test statistics when there is no true signal.

The main output files is <outfile>.assoc, which provides full results for each predictor. <outfile>.summaries provides the results necessary for performing analyses using SumHer, while <outfile>.pvalues contains just the p-values. <outfile>.coeff provides estimates of the fixed effects, while <outfile>.score contains prediction models corresponding to six different p-value thresholds.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotypes quant.pheno and binary.pheno, and the covariates covar.covar from the Test Datasets.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1  - Classical linear regression.

We regress quant.pheno on each SNP in the genetic data by running

./ldak.out --linear single --bfile human --pheno quant.pheno

The main results are saved in single.assoc. To repeat this including the covariates, we run

./ldak.out --linear single2 --bfile human --pheno quant.pheno --covar covar.covar

The main results are saved in single2.assoc. The most significant SNP is 21:15603999 (P=5e-17 without covariates, 2e-14 with covariates). We can perform a conditional analysis including this SNP as a covariate by running

echo 21:15603999 > top.txt
./ldak.out --linear single3 --bfile human --pheno quant.pheno --top-preds top.txt

The main results are saved in single3.assoc.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2  - Mixed-model linear regression.

We need to create a kinship matrix. We recommend doing this assuming a thinned version of the GCTA Model, for which we must obtain a list of predictors in approximate linkage equilibrium

./ldak.out --thin le --bfile human --window-prune .05 --window-cm 1
./ldak.out --calc-kins-direct le --bfile human --ignore-weights YES --power -1 --extract le.in

The list of thinned predictors is saved in le.in. We perform mixed-model linear regression by running

./ldak.out --linear single4 --bfile human --pheno quant.pheno --grm le

The main results are saved in single4.assoc.

To perform LOCO analysis, we must first create a kinship matrix for each chromosome which is constructed using only predictors on other chromosomes (a complementary kinship matrix). We can do this using the following script

#First we create per-chromosome kinship matrices
for j in {21..22}; do
./ldak.out --calc-kins-direct le$j --bfile human --ignore-weights YES --power -1 --extract le.in --chr $j
done

#Then we join these to obtain a genome-wide kinship matrix
rm list.All

for j in {21..22}; do echo "le$j" >> list.All; done
./ldak.out --add-grm leAll --mgrm list.All

#Finally, we subtract the per-chromosome kinship matrices from the genome-wide matrix
for j in {21..22}

do echo "leAll
le$j" > list.$j
./ldak.out --sub-grm leN$j --mgrm list.$j
done

The complementary kinship matrices are saved with stems leN21 and leN22. Note that in this script, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22. Now we can perform the mixed-model linear regression for each chromosome in turn by running

for j in {21..22}
do ./ldak.out --linear loco$j --bfile human --pheno quant.pheno --grm leN$j --chr $j
done

The main results are saved in locov21.assoc and locov22.assoc.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3  - Classical logistic regression.

We regress binary.pheno on each SNP in the genetic data by running

./ldak.out --logistic single5 --bfile human --pheno binary.pheno

The main results are saved in single5.assoc.