Making Data

LDAK provides commands for converting data from one format to another:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

LDAK can create or convert datasets using the arguments --make-bed <output>, --make-gen <output>, --make-sp <output>, --make-sped <output> or --make-speed <output>.

To specify the starting dataset, use either --bfile/--gen/--sp/--speed <prefix> to provide a single dataset, or --mbfile/--mgen/--msp/--mspeed <prefix_list> to provide multiple datasets. For the latter, you should use --common-samples <YES/NO> and --common-preds <YES/NO> to specify whether to use all samples / predictors, or only those common to all datasets. To filter predictors based on allele frequency, variance or missingness, see the Data Filtering options.

As well as to filter or merge datasets, these arguments can be used to convert the genotype codings (see below), or to convert datasets from one format to another. Be aware that some conversions will require loss of information. For example, bed format only accepts values 0, 1, 2 or NA, so if you begin with either probabilities or dosages (gen or SP format), then it will be necessary to convert to hard genotypes (for which you should use either the option --threshold or --min-prob). When converting data, by default all samples and predictors will be retained, but this can be changed using the options here.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Genotype codings: --encoding <1/2/3/4> (suitable only for hard count data)
This option determines how 0/1/2 hard genotype calls are recoded:
1 - 0/1/2 - additive model (default)
2 - 0/2/2 - dominant model
3 - 0/0/2 - recessive model
4 - 0/2/0 - heterogeneous model.

 

The following five options can be used when Making Data:

--min-maf <float> - tells LDAK to exclude predictors that have minor allele frequency (MAF) less than <float>. For heritability analyses, it is standard to restrict to common SNPs (those with MAF above, say, 0.01 or 0.005), because it is not clear how to model the heritability of rare SNPs.

--max-maf <float> - tells LDAK to exclude predictors that have minor allele frequency (MAF) more than <float>.

--min-var <float> - tells LDAK to exclude predictors which have variance less than <float>.

--min-obs <float> - tells LDAK to exclude predictors with values recorded for less than a proportion <float> of the individuals.

--min-info <float> - tells LDAK to exclude predictors with LDAK information score less than <float>. The LDAK information score estimates the correlation squared between the observed genotypes and the true genotypes; it is formally described in the Methods of our paper Reevaluation of SNP heritability in complex human traits, Nature Genetics, 2017.

Note that while --min-var and --min-obs can be used for all types of data, --min-maf and --max-maf can only be used with SNP data (all genotypes are values with [0,2]), while --min-info can only be used with SNP data that provides genotype probabilities e.g., the results from imputation).