PCGC Regression

Thanks to Omer Weissbrod for assistance with this page.

PCGC (phenotype-correlation genotype-correlation) Regression is an alternative to REML when estimating heritability for binary traits (i.e., diseases). It was invented by Golan, Lander and Rosset, then subsequently developed further by Weissbrod, Flint and Rossett. Its main advantage is that it allows for the fact that ascertainment (i.e., over-sampling of cases) results in a bimodal liability distribution and generates correlations between the phenotype and covariates.

In addition to the features described in the original paper, the implementation in LDAK is able to incorporate regions and to estimate heritability using only individuals in different cohorts, which should protect against inflation due to genotyping errors (e.g., when including poorly-genotyped SNPs). To perform PCGC Regression, you must first compute one or more kinship matrices (unless only using regions). When computing these, consider using the option --predictor-means to centre and scale predictors based on external estimates of predictor means. Also, if you plan to include covariates, you should first regress the kinship matrices on the covariates.

The argument for PCGC Regression is

--pcgc <output>

There are two required options, and many optional ones.

--pheno <phenofile> - specifies the response (in PLINK format). Individuals without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno.

--prevalence <float> - to specify the population prevalence.

--grm <grmstem> or --mgrm <grmlist> - provide one or more kinship matrices. Note that if using covariates, these should be adjusted kinship matrices.

--region-number and --region-prefix - provide one or more regions, in which case you must also specify the datafiles with --bfile/--gen/--sp/--speed <prefix>, use --weights to specify the predictor weightings (or --ignore-weights YES to set them to 1)  and --power to indicate how to scale predictors (we advise using -0.25). By default, LDAK will remove a regional predictor if (effectively) identical to one which remains (correlation squared > 0.98); to change this threshold use --region-prune.

--covar <covarfile> - provide covariates (in PLINK format) as fixed effects in the regression; when calculating heritabilties, the variance explained by these will be discounted.

--top-preds <list_of_predictors> - provide a list of predictors  to include as fixed effects; when calculating heritabilities, the variance explained by these will be added to that explained by the kinship matrices and regions. Usually, these represent a pruned subset of highly-associated predictors with large effects (see the section "Accommodating loci with very large effects" in our recent paper).

--memory-save <YES/NO> (default NO) - by default, LDAK will read into memory all kinship matrices at the start. If there are many kinship matrices, this can require large amounts of memory; therefore, consider adding --memory-save YES, and LDAK will instead read kinship matrices on-the-fly each time they are required.

--permute YES - the phenotypic values will be shuffled. This is useful if wishing to perform permutation analysis to see the distribution of estimates of variance explained obtained when there is no true signal.

The main output files are <output>.pcgc, which contains liability-scale estimates of the proportion of variance explained by each kinship matrix, region and the top predictors, and <output>.share, which provides estimates of the fractions of total variance explained. Note that when covariates are included, for consistency with REML and HE Regression, LDAK discounts their contribution when estimating proportions of variance explained (i.e., the values in <output>.pcgc are estimates of conditional heritabilities). By contrast, the authors of PCGC recommend that the contribution of covariates is included in the denominator; these estimates of marginal heritablities are provided in <output>.pcgc.marginal.

If your samples come from multiple cohorts, you can use --subset-number and --subset-prefix to specify which samples are in each cohort (see Subset Options). Then LDAK will also produce the files <output>.pcgc.within, which contains estimates of heritability based only on pairs of samples in the same cohort, and <output>.pcgc.across, which contains estimates of heritability based only on pairs of samples in different cohorts.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example: for this we use the phenotype phen_binary.pheno available in the Test Datasets (which we assume corresponds to a disease with prevalence 0.01), the kinship matrices in the folder partitions calculated in Get Kinships, and the kinship matrices adjusted for the covariate file sex.covar calculated in Adjust Kinships.

First we regress the phenotype phen_binary.pheno on the kinship matrix with stem partitions/kinships.all

../ldak.out --pcgc res9 --pheno phen_binary.pheno --grm partitions/kinships.all --prevalence 0.01

To include the covariate file sex.covar, we would add --covar and instead use the adjusted kinship matrix

../ldak.out --pcgc res9b --pheno phen_binary.pheno --grm partitions/kinships.all.sex --covar sex.covar --prevalence 0.01

Next we regress the phenotype on the three kinship matrices listed in partitions/partition.list

../ldak.out --he res10 --pheno phen.pheno --mgrm partitions/partition.list

To include the covariate file sex.covar, instead use

echo "partitions/kinships.1.sex
partitions/kinships.2.sex
partitions/kinships.3.sex" > mlist.txt
../ldak.out --he res10b --pheno phen.pheno --mgrm mlist.txt --covar sex.covar

Finally, we repeat the second example adding subsets (for this we assume that Cohort 1 contains the first 100 samples, with the remainder in Cohort 2)

awk < test.fam '(NR<=100){print $1, $2}' > cohort1
awk < test.fam '(NR>100){print $1, $2}' > cohort2
../ldak.out --pcgc res9b --pheno phen_binary.pheno --grm partitions/kinships.all.sex --covar sex.covar --prevalence 0.01 --subset-number 2 --subset-prefix cohort