Small Datasets

Heritability analysis generally requires that the individuals are "unrelated" (in practice, this means at most distantly related, with no pair closer than, say, second cousins). This is to ensure that the heritability estimates reflect only the causal variation the predictors in the dataset tag directly (i.e., through local linkage disequilibrium) and not also causal variation the predictors tag indirectly (i.e., because of long-range linkage disequilibrium caused by familial relatedness).

Furthermore, heritability analyses generally require a large sample size. For example, to reliably estimate SNP heritability (standard deviation less than 5%) using a single kinship matrix typically needs at least 7,000 unrelated individuals; if you wish to use multiple kinship matrices (i.e., perform Genomic Partitioning), the required number of samples is even higher.

We recommend that you do not try to estimate SNP heritability (nor perform genomic partitioning), if after performing careful Quality Control, you have fewer than 5000 samples (see the exception below). If your sample size just exceeds this limit (say, you have 5000-7000 samples), you should consider increasing the minor allele frequency threshold (e.g., restrict to SNPs with MAF>0.05, instead of those with MAF>0.01), as this will slightly increase the precision of estimates.

Note that you can consider relaxing the above sample size requirement if you have multiple phenotypes, and you are interested in overall patterns, rather than results for any single phenotype. This is because even though individual heritability estimates will be imprecise, the average across many phenotypes can still be informative. An example is if you are analysing transcriptome data, where each individual is recorded for thousands of gene expressions.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

If your dataset is too small to perform heritability analysis, then we suggest you focus on association testing. LDAK is able to perform both single-predictor association analysis and gene-based association analysis, both classically and within a mixed-model framework (the latter is useful when your dataset contains related individuals and/or population structure).