Single-Predictor Analysis

This page explains how to perform classical linear or logistic regression; if you are interested in mixed-model versions, you should instead use LDAK-KVIK.

Here we explain how to perform one-predictor-at-a-time analysis, using linear regression (either classical or mixed-model) or logistic regression (only classical). When analysing a binary phenotype, it is theoretically better to use logistic regression, but in practice, linear regression usually suffices (and is substantially faster).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Linear regression:

The main argument is --linear <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--pheno <phenofile> - to specify phenotypes (in PLINK format). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

You can use --covar <covarfile> to provide covariates (in PLINK format) as fixed effect in the regression. If the data contains predictors that are definitely associated with the phenotype, you can include these as fixed effects using --top-preds <toppredslist>. This is useful if wishing to perform a conditional analysis (i.e., to find secondary associations).

If you add --spa-test YES, LDAK will recompute p-values for the most associated predictors using a SaddlePoint Approximation.

It remains possible to  add  --grm <kinfile>, in which case LDAK will perform mixed-model linear regression. However, we instead recommend using LDAK-KVIK, which is usually faster and more powerful.

To perform a within-family analysis, add --families YES (for more details of this analysis, see Howe et al.). Note that LDAK will infer families based on the 1st column of the fam file (the FID).

To perform a trio analysis, add --trios YES. Note that LDAK will infer trios based on the 3rd and 4th column of the fam file (the PID and MID).

To perform weighted linear regression, use --sample-weights <sampleweightfile>. The file <sampleweightfile> should have three columns, where each row provides two sample IDs followed by a positive float. Note that by default, LDAK will use the sandwich estimator of the effect size variance (see this page for an explanation); to instead revert to the standard estimator of variance, add --sandwich NO.

If you add --permute YES - the phenotypic values will be shuffled. This is useful if wishing to perform permutation analysis to see the distribution of p-values or test statistics when there is no true signal.

When performing a standard analysis, LDAK produces five output files: <outfile>.assoc contains the main results; <outfile>.summaries contains summary statistics (in the format required for use with SumHer, and MegaPRS); <outfile>.pvalues contains p-values (useful if you wish to Thin Predictors); <outfile>.coeff contains estimates of the fixed effects; <outfile>.score contains simple prediction models corresponding to six different p-value thresholds.

When performing a within-family analysis, LDAK produces <outfile>.basic and <outfile>.families, which contain the results from the basic and within-family analyses, respectively.

When performing a trio analysis, LDAK produces <outfile>.basic and <outfile>.trios, which contain the results from the basic and trio analyses, respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Logistic regression:

The main argument is --logistic <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--pheno <phenofile> - to specify phenotypes (in PLINK format). The phenotype must be binary (either cases 1 and controls 0, or cases 2, controls 1). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

You can use --covar <covarfile> to provide covariates (in PLINK format) as fixed effect in the regression. If the data contains predictors that are definitely associated with the phenotype, you can include these as fixed effects using --top-preds <toppredslist>. This is useful if wishing to perform a conditional analysis (i.e., to find secondary associations).

By default, LDAK recomputes p-values for the most associated predictors using a SaddlePoint Approximation; you can stop this by adding --spa.test NO.

If you add --permute YES - the phenotypic values will be shuffled. This is useful if wishing to perform permutation analysis to see the distribution of p-values or test statistics when there is no true signal.

LDAK produces five output files: <outfile>.assoc contains the main results; <outfile>.summaries contains summary statistics (in the format required for use with SumHer, and MegaPRS); <outfile>.pvalues contains p-values (useful if you wish to Thin Predictors); <outfile>.coeff contains estimates of the fixed effects; <outfile>.score contains simple prediction models corresponding to six different p-value thresholds.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotypes quant.pheno and binary.pheno, and the covariates covar.covar from the Test Datasets.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1  - Classical linear regression.

We regress quant.pheno on each SNP in the genetic data by running

./ldak.out --linear single --bfile human --pheno quant.pheno

The main results are saved in single.assoc. To repeat this including the covariates, we run

./ldak.out --linear single2 --bfile human --pheno quant.pheno --covar covar.covar

The main results are saved in single2.assoc. The most significant SNP is 21:15603999 (P=5e-17 without covariates, 2e-14 with covariates). We can perform a conditional analysis including this SNP as a covariate by running

echo 21:15603999 > top.txt
./ldak.out --linear single3 --bfile human --pheno quant.pheno --top-preds top.txt

The main results are saved in single3.assoc.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2  - Mixed-model linear regression.

We need to create a kinship matrix. We recommend doing this assuming a thinned version of the GCTA Model, for which we must obtain a list of predictors in approximate linkage equilibrium

./ldak.out --thin le --bfile human --window-prune .05 --window-cm 1
./ldak.out --calc-kins-direct le --bfile human --power -1 --extract le.in

The list of thinned predictors is saved in le.in. We perform mixed-model linear regression by running

./ldak.out --linear single4 --bfile human --pheno quant.pheno --grm le

The main results are saved in single4.assoc.

To perform LOCO analysis, we must first create a kinship matrix for each chromosome which is constructed using only predictors on other chromosomes (a complementary kinship matrix). We can do this using the following script

#First we create per-chromosome kinship matrices
for j in {21..22}; do
./ldak.out --calc-kins-direct le$j --bfile human --power -1 --extract le.in --chr $j
done

#Then we join these to obtain a genome-wide kinship matrix
rm list.All

for j in {21..22}; do echo "le$j" >> list.All; done
./ldak.out --add-grm leAll --mgrm list.All

#Finally, we subtract the per-chromosome kinship matrices from the genome-wide matrix
for j in {21..22}

do echo "leAll
le$j" > list.$j
./ldak.out --sub-grm leN$j --mgrm list.$j
done

The complementary kinship matrices are saved with stems leN21 and leN22. Note that in this script, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22. Now we can perform the mixed-model linear regression for each chromosome in turn by running

for j in {21..22}
do ./ldak.out --linear loco$j --bfile human --pheno quant.pheno --grm leN$j --chr $j
done

The main results are saved in locov21.assoc and locov22.assoc.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3  - Classical logistic regression.

We regress binary.pheno on each SNP in the genetic data by running

./ldak.out --logistic single5 --bfile human --pheno binary.pheno

The main results are saved in single5.assoc.