HE Regression

HE (Haseman-Elston) Regression is a method for estimating heritability, which for large datasets (>10,000 samples) is substantially faster than REML (although the estimates will be less precise). It is also able to estimate heritability using only selected pairs of samples, which provides a way to test for and protect against inflation due to genotyping errors (e.g., if the analysis includes poorly-genotyped SNPs).

Note that if covariates and/or top predictors are provided, LDAK will first regress the phenotype on these, then perform HE Regression using the residuals (this contrasts with other analyses, where LDAK includes the covariates and top predictors within the main analysis). For this analysis to be valid, you should first Adjust Kinships, by regressing each kinship matrix on the covariates and top predictors.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --he <outfile>.

The only required option is

--pheno <phenofile> - to specify the phenotypes (in PLINK format). Samples without a phenotype will be excluded. If <phenofile> contains more than one phenotype, specify which should be used with --mpheno <integer>.

However, in most cases, you will also use --grm <kinfile> or --mgrm <kinstems> to provide one or more kinship matrices.

To provide regions, use --region-number <integer> and --region-prefix <prefix>, where the files <regprefix>1, <regprefix>2, ..., list the predictors in each region. You must also specify the genetic data files with --bfile/--gen/--sp/--speed <datastem>, and use --weights <weightsfile> (or --ignore-weights YES) as well as --power <float> to indicate how to scale predictors. By default, LDAK will remove a regional predictor if (effectively) identical to one which remains (correlation squared > 0.98); to change this threshold use --region-prune <float>.

If your samples come from multiple cohorts, you can use --subset-number <integer> and --subset-prefix <subprefix> to specify which samples are in each cohort (see Sample Subsets). LDAK will then perform two additional regressions, first only using pairs of samples in the same cohort, then only using pairs of samplesls in different cohorts.

Use --covar <covarfile> to provide covariates (in PLINK format) as fixed effects in the regression; when calculating heritabilties, the phenotypic variance explained by these will be discounted.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples (e.g., to exclude ancestral outliers or relatedness).

To include some predictors as fixed effects, use --top-preds <toppredsfile>, in which case you must also specify the genetic data files with --bfile/--gen/--sp/--speed <datastem>; when calculating heritabilities, the phenotypic variance explained by these predictors will be added to that explained by the kinship matrices and regions. Usually, <toppredsfile> will contain a pruned subset of highly-associated predictors; for more details, see the section "Accommodating loci with very large effects" in our paper Reevaluation of SNP heritability in complex human traits (Nature Genetics, 2017).

If the phenotype is binary, you can use --prevalence <float> to specify the population prevalence. LDAK will then additionally report heritability estimates on the liability scale (note that for binary traits, it is often preferable to use PCGC Regression).

By default, LDAK will read into memory all kinship matrices at the start. If there are many kinship matrices, this can require large amounts of memory; therefore, consider adding --memory-save YES, and LDAK will instead read kinship matrices on-the-fly each time they are required.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main output files are

<outfile>.he - contains estimates of the heritability contributed by each kinship matrix, region and the top predictors. It also reports for each kinship matrix and region the (mega) intensity, which equals its heritability divided by its size (x 1,000,000); this can be useful for assessing the relative importance of the kinship matrices and regions.

<outfile>.share - contains estimates of the fraction of heritability explained by each kinship matrix and region. It also reports for each the enrichment, which equals its estimated fraction divided by its expected fraction (assuming no enrichment).

If you use subsets, then LDAK will also create the files <outfile>.he.within and <outfile>.he.across, which contain estimates based only on pairs of samples in the same cohort and based only on pairs of samples in different cohorts, respectively. Further, LDAK will create the file <outfile>.he.compare, which includes results from a likelihood ratio test that the two sets of estimates are consistent.

If the phenotype is binary and its prevalence is specified, there will be additional output files with the suffix .liab, which provide estimates converted from the observed to the liability scale.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype quant.pheno, the covariates covar.covar, the list of SNPs part1, and the lists of samples ind1 and ind2 from the Test Datasets. We also use the kinship matrix with stem LDAK-Thin created in the example for Calculate Kinships.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To regress the phenotype on the kinship matrix, run

./ldak.out --he he1 --pheno quant.pheno --grm LDAK-Thin

By viewing he1.he, we see that the estimate of the heritability contributed by the kinship matrix is 0.58 (SD 0.16).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

We can include a region using the command

./ldak.out --he he2 --pheno quant.pheno --grm LDAK-Thin --region-prefix part --region-number 1 --bfile human --ignore-weights YES --power -.25

The estimated heritabilities contributed by the kinship matrix and region are 0.38 (SD 0.16) and 0.21 (SD 0.11), respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To repeat the first analysis including covariates, we must first regress the kinship matrix on the covariate file

./ldak.out --adjust-grm LDAK-Thin.covar --grm LDAK-Thin --covar covar.covar

The adjusted kinship matrix is saved with stem LDAK-Thin.covar. We can now regress the phenotype on this using the command

./ldak.out --he he3 --pheno quant.pheno --grm LDAK-Thin.covar --covar covar.covar --kinship-details NO
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Finally, to repeat the first analysis including subsets, we can use

./ldak.out --he he4 --pheno quant.pheno --grm LDAK-Thin --subset-prefix ind --subset-number 2

The file he4.he matches he1.he, and shows that the estimate of the heritability contributed by the kinship matrix (calculated using all samples) is 0.58 (SD 0.16). The files he4.he.within and he4.he.across show that the estimated heritability is instead 0.60 (SD 0.18) and 0.55 (SD 0.20) if calculated using only samples in the same or in different cohorts. he4.he.compare shows that the difference between these two estimates is not significant (P=0.83).