Principal Component Analysis

The page explains how to compute principal component axes, and the corresponding predictor weightings. Often principal component analysis (PCA) is used to investigate population structure in the dataset; at the bottom are instructions on how to perform PCA on a reference dataset (e.g., HapMap or 1000 Genomes), then project your dataset onto these.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Having calculated a kinship matrix, you can compute the leading eigenvectors; these will be the top principal component axes of the predictors across which kinships were calculated.

–pca <output>

which uses the following options (to restrict to a subset of data, see Data Filtering):

–grm <grmstem> – required to provide the kinship matrix to decompose

–axes <integer> (default 20) -how many eigenvectors / principal components are desired.

This produces <output>.vectors and <output>.values, which contain the leading eigenvectors and the corresponding eigenvectors. You can subsequently use –covar <output>.vectors to supply these axes as covariates for REML analyses.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Each principal component axis is a linear combination of predictors. To compute the predictor loadings, use

–calc-pca-loads <output>

which requires the following options:

–grm <grmstem> – to specify the kinship matrix used when performing PCA.

–bfile/–chiamo/–sp/–speed <prefix> – to specify the data files used for calculating the kinshp matrix.

This produces the files <output>.load and <output>.proj which contain the predictor weights and the projections of the data file onto these weights. To project a new dataset onto these weights, supply <output>.load as the score file when profile scoring.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

These arguments can be used to create population informative SNP loadings. Here is a suggested protocol, which supposes you have downloaded the ped and map files from this page, then used PLINK to save these as hapmap.bed, hapmap.bim and hapmap.fam.

awk < data.bim ‘{print $2~’ > all.snps

plink –bfile hapmap –extract all.snps –indep-pairwise 200 10 .05 –out prune

./ldak.out –calc-kinships-direct hapmap –bfile hapmap –extract prune.prune.in –ignore-weights YES

./ldak.out –calc-pca-loads hapmap –grm hapmap –bfile hapmap

./ldak.out –calc-scores data –profile hapmap.load

These scripts prune the HapMap dataset (considering only SNPs present in your data), perform PCA, then project your data onto the corresponding loadings. You can then assess the ethnicity and homogeneity of your data by plotting Columns 3 & 5 of data.profile on top of Columns 3 & 4 of hapmap.proj (or hapmap.vectors).