Thanks to Omer Weissbrod for assistance with this feature.
PCGC (phenotype-correlation genotype-correlation) Regression is an alternative to REML when estimating heritability for binary traits (i.e., diseases). It was invented by Golan, Lander and Rosset, then subsequently developed further by Weissbrod, Flint and Rossett. PCGC has two main advantages over REML. Firstly, it estimates heritability on the liability scale (which is more appropriate for binary traits). Secondly, it corrects for the fact that ascertainment (i.e., over-sampling of cases) results in a bimodal liability distribution and generates correlations between the phenotype and covariates (which violate estimates of REML).
In addition to the features described in the original paper, the implementation in LDAK incorporates regions and allows for Sample Subsets. Note that the authors of PCGC recommend using external estimates of the predictor means when calculating the genotypic correlations; if you wish to follow this recommendation, you should add the option --predictor-means <meansfile> when Calculating Kinships (in our experience, the impact on results is negligible). Finally, if you plan to include covariates, you should first Adjust Kinships, by regressing each kinship matrix on the covariates and top predictors.
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --pcgc <outfile>
The two required options are
--pheno <phenofile> - specifies phenotypes (in PLINK format). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.
--prevalence <float> - to specify the population prevalence.
However, in most cases, you will also use --grm <kinfile> or --mgrm <kinstems> to provide one or more kinship matrices.
To provide regions, use --region-number <integer> and --region-prefix <regprefix>, where the files <regprefix>1, <regprefix>2, ..., list the predictors in each region. You must also specify the genetic data files with --bfile/--gen/--sp/--speed <datastem>, and use --weights <weightsfile> (or --ignore-weights YES) as well as --power <float> to indicate how to scale predictors. By default, LDAK will remove a regional predictor if (effectively) identical to one which remains (correlation squared > 0.98); to change this threshold use --region-prune <float>.
If your samples come from multiple cohorts, you can use --subset-number <integer> and --subset-prefix <subprefix> to specify which samples are in each cohort (see Sample Subsets). LDAK will then perform two additional regressions, first only using pairs of samples in the same cohort, then only using pairs of samples in different cohorts.
You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples (e.g., to exclude ancestral outliers or relatedness).
Use --covar <covarfile> to provide covariates (in PLINK format) as fixed effects in the regression; when calculating heritabilties, the phenotypic variance explained by these will be discounted.
To include some predictors as fixed effects, use --top-preds <toppredslist>, in which case you must also specify the genetic data files with --bfile/--gen/--sp/--speed <datastem>; when calculating heritabilities, the phenotypic variance explained by these predictors will be added to that explained by the kinship matrices and regions. Usually, <toppredslist> will contain a pruned subset of highly-associated predictors; for more details, see the section "Accommodating loci with very large effects" in our paper Reevaluation of SNP heritability in complex human traits (Nature Genetics, 2017).
By default, LDAK will read into memory all kinship matrices at the start. If there are many kinship matrices, this can require large amounts of memory; therefore, consider adding --memory-save YES, and LDAK will instead read kinship matrices on-the-fly each time they are required.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main output files are
<outfile>.pcpc - contains estimates of the heritability contributed by each kinship matrix, region and the top predictors (all on the liability scale). It also reports for each kinship matrix and region the (mega) intensity, which equals its heritability divided by its size (x 1,000,000); this can be useful for assessing the relative importance of the kinship matrices and regions.
<outfile>.share - contains estimates of the fraction of heritability explained by each kinship matrix and region. It also reports for each the enrichment, which equals its estimated fraction divided by its expected fraction (assuming no enrichment).
If you use subsets, then LDAK will also create the files <outfile>.pcgc.within and <outfile>.he.across, which contain estimates based only on pairs of samples in the same cohort and based only on pairs of samples in different cohorts, respectively. Further, LDAK will create the file <outfile>.pcgc.compare, which includes results from a likelihood ratio test that the two sets of estimates are consistent.
Note that when covariates are included, for consistency with REML and HE Regression, LDAK discounts their contribution when estimating proportions of variance explained (i.e., the values in <outfile>.pcgc are estimates of conditional heritabilities). By contrast, the authors of PCGC recommend that the contribution of covariates is included in the denominator; these estimates of marginal heritablities are provided in <outfile>.pcgc.marginal.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype binary.pheno, the covariates covar.covar, the lists of SNPs part1, and the lists of samples ind1 and ind2 from the Test Datasets. We also use the kinship matrix with stem LDAK-Thin created in the example for Calculate Kinships. We assume that the phenotype binary.pheno corresponds to a disease with prevalence 1%.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
To regress the phenotype on the kinship matrix, run
./ldak.out --pcgc pcgc1 --pheno binary.pheno --grm LDAK-Thin --prevalence .01
By viewing pcgc1.pcgc, we see that the estimate of the heritability contributed by the kinship matrix is 0.26 (SD 0.11).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
We can include a region using the command
./ldak.out --pcgc pcgc2 --pheno binary.pheno --grm LDAK-Thin --prevalence .01 --region-prefix part --region-number 1 --bfile human --ignore-weights YES --power -.25
The estimated heritabilities contributed by the kinship matrix and region are 0.21 (SD 0.12) and 0.06 (SD 0.06), respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
To repeat the first analysis including covariates, we must first regress the kinship matrix on the covariate file
./ldak.out --adjust-grm LDAK-Thin.covar --grm LDAK-Thin --covar covar.covar
The adjusted kinship matrix is saved with stem LDAK-Thin.covar. We can now regress the phenotype on this using the command
./ldak.out --pcgc pcgc3 --pheno binary.pheno --grm LDAK-Thin.covar --prevalence .01 --covar covar.covar --kinship-details NO
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Finally, to repeat the first analysis including subsets, we can use
./ldak.out --pcgc pcgc4 --pheno binary.pheno --grm LDAK-Thin --prevalence .01 --subset-prefix ind --subset-number 2
The file pcgc4.pcgc matches pcgc1.pcgc, and shows that the estimate of the heritability contributed by the kinship matrix (calculated using all samples) is 0.26 (SD 0.11). The files pcgc4.pcgc.within and pcgc4.pcgc.across show that the estimated heritability is instead 0.34 (SD 0.13) and 0.18 (SD 0.15) if calculated using only samples in the same or in different cohorts. pcgc4.pcgc.compare shows that the difference between these two estimates is not significant (P=0.44).