Principal Component Analysis

Here we explain how to compute axes, eigenvalues and predictor loadings from principal component analysis (PCA). We most commonly use PCA as part of Quality Control, either to detect outliers within the data, or to construct population covariates.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Perform principal component analysis:

The main argument is --pca <outfile>.

This requires the options

--grm <kinfile> - to provide a kinship matrix.

--axes <integer> - to specify the number of axes (usually 20 are sufficient).

This produces <outfile>.vectors and <outfile>.values, which contain the leading axes and the corresponding eigenvalues. The file <outfile>.vectors can be used as covariates for subsequent analyses (e.g., when performing REML, Haseman Elston or PCGC Regession), by adding the option  --covar <outfile>.vectors. However, in most cases, you will want to first modify this file (e.g., extract only a subset of axes, or add in other covariates, such as age and sex).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Calculate predictor loadings:

The main argument is --calc-pca-loads <outfile>.

This requires the options

--pcastem <pcastem> - to provide the stem of results from performing PCA (i.e., LDAK will expect to find the files <pcastem>.vectors and <pcastem>.values).

--grm <kinfile> - to provide the kinship matrix used when performing PCA.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files used to calculate the kinship matrix.

This produces the files <outfile>.load and <outfile>.proj, which contain the predictor loadings and the projections of the genetic data onto these loadings. To project a new dataset onto these axes, supply <outfile>.load as the score file when Calculating Scores.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam from the Test Datasets and the kinship matrix with stem LDAK-Thin created in the example for Calculate Kinships.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To compute the top 20 principal component axes, we run

./ldak.out --pca LDAK-Thin --grm LDAK-Thin --axes 20

The top 20 axes are saved in LDAK-Thin.vector, with the corresponding eigenvalues in LDAK-Thin.values. Each axes can be written as a linear combination of predictor values. We can obtain the predictor loadings by running

./ldak.out --calc-pca-loads LDAK-Thin --pcastem LDAK-Thin --grm LDAK-Thin --bfile human

Here we provided the genetic data with stem human, as these were used to make the kinship matrix. The loadings are saved in LDAK-Thin.load.