Data File Formats

LDAK accepts data in many formats. The most commonly-used is bed format (binary PLINK) which accommodates (hard-coded) SNP genotypes. Gen format is highly flexible; it is designed for reading genotype probabilities created by IMPUTE2, but can also accommodate, say, the output from other imputation software, haplotypes or non-genotype data (e.g., gene expression). The third option is Sparse Partitioning  (SP) format, which simply requires data in a large matrix (rows are predictors, columns are samples).

To filter either samples or predictors, see the Data Filtering options. LDAK is usually applied to SNP data, in which case all predictors take values between 0 and 2 (representing the count of the A allele). However, LDAK can also be applied to other datatypes; for this your data should be in either gen or SP format, and you should use the option --non-SNP YES.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Bed format: --bfile <prefix>
Requires: <prefix>.bed, <prefix>.bim and <prefix>.fam (see the PLINK website for more details).

The bed file contains the SNP genotypes, coding as AA, AB, BB or missing, where A is the first allele and B is the second; note that the bed file is stored in binary format; it is not (easily) human readable. The bim file has one row per SNP and six columns, which provide the chromosome, name, genetic and physical distance of each SNP, as well as the bases corresponding to the 1 and 0 alleles (note that genetic distances are often not provided, in which case the third column values will be zero). The fam file has one row per individual and six columns, which provide the Individual ID, the Family ID, as well as Maternal and Paternal IDs, Sex and Phenotype. Note that LDAK only uses the first two IDs; the remaining four columns are ignored (so to use sex as a covariate or to provide phenotypic values, these need to be supplied separately).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Gen format: --gen <genfile>
Requires: --sample <samplefile> or --fam <famfile> and --oxford-single-chr <integer> or --bim <bimfile>
Optional: --gen-skip (default 0), --gen-headers (default 5) and --gen-probs (default 3)

The genfile contains predictor values, one row per predictor. --gen-skip indicates how many header rows (typically 0 or 1); --gen-headers indicates how many header columns (typically 0 to 5). --gen-probs should be 0, 1, 2, 3 or 4:
0 - haplotypes - predictor values should be "0 0", "0 1", "1 0" or "1 1"
1 - dosages - predictors provide the (expected) number of A alleles
2 - two probs - providing probability of being AA or AB
3 - three probs - providing probabilities of being AA, AB or BB
4 - four probs - providing probabilties of being AA, AB, BB or NA,
where A and B are the A1 and A2 alleles

The default gen format assumes the genfile has no header row, five header columns (blank, SNP ID, BP, A1, A2) and three probabilities, matching the output from IMPUTE2 (see the Oxford Stats webpages for details).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SP format: --sp <prefix>
Requires <prefix>.sp, <prefix>.bim, <prefix>.fam.

SP format is a raw format where the main data file (.sp) has one row per predictor and one column per sample. If the genotypes are hard calls, element (i,j) would take value 0/1/2 depending on whether the jth sample is homozygous (major) , heterozygous or homozygous (minor) for the ith predictor. When using  dosage data, the values corresponds to the expected allele count:
data[i,j] = Prob(heterozyous) + 2*Prob(homozygous (minor)). However, the data need not correspond to allele counts, so, for example, gene expression and methylation data can be accommodated, and negative values are allowed.

The value "-99" denotes missing data (this value can be changed with the option --missingvalue <float>, see Extra Information). As with PLINK format, the bim and fam files contain details for the predictors and samples, respectively. Note that in the bim file, the alleles for each SNP are provided in the order minor than major. So if alleles for a SNP were C then G, then the genotype states 0, 1 and 2 would correspond to GG, CG and CC, respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Old Binary SP format: --sped <prefix>
Requires <prefix>.sped, <prefix>.bim, <prefix>.fam

This is the same as SP format, except the main data file is in binary format. Values are stored as floats in row-major order; if there are 1000 samples, then the first 1000 values would correspond to the first predictor, the next 1000 values to the second predictor, and so on. To store each float requires 64 bits (8 bytes), meaning that a dataset containing 5000 samples and 1 million predictors would require 5000 x 1 000 000 x 8 = 40 000 000 000 bytes = 40 Gb. This format has been largely superseded by the (new) binary SP format below.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Binary SP format: --speed <prefix>
Requires <prefix>.speed, <prefix>.bim, <prefix>.fam

The same as Old Binary SP format,  except that the values are stored as truncated float. Two levels of accuracy are available: short stores each float using 8 bits (1 byte); long stores each float using 16 bits (2 bytes). Therefore, to store a dataset of 5000 samples and 1 million predictors, would require either 5 or 10 Gb. When creating binary SP datasets, the default is to use short accuracy, but this can be changed using --speed-long <YES/NO>.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

LDAK can create or convert datasets using the arguments --make-bed <output>, --make-gen <output>, --make-sp <output>, --make-sped <output> or --make-speed <output>.

To specify the starting dataset, use either --bfile/--gen/--sp/--speed <prefix> to provide a single dataset, or --mbfile/--mgen/--msp/--mspeed <prefix_list> to provide multiple datasets. For the latter, you should use --common-samples <YES/NO> and --common-preds <YES/NO> to specify whether to use all samples / predictors, or only those common to all datasets. To filter predictors based on allele frequency, variance or missingness, see the Data Filtering options.

As well as to filter or merge datasets, these arguments can be used to convert the genotype codings (see below), or to convert datasets from one format to another. Be aware that some conversions will require loss of information. For example, bed format only accepts values 0, 1, 2 or NA, so if you begin with either probabilities or dosages (gen or SP format), then it will be necessary to convert to hard genotypes (for which you should use either the option --threshold or --min-prob). When converting data, by default all samples and predictors will be retained, but this can be changed using the options here.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Genotype codings: --encoding <1/2/3/4> (suitable only for hard count data)
This option determines how 0/1/2 hard genotype calls are recoded:
1 - 0/1/2 - additive model (default)
2 - 0/2/2 - dominant model
3 - 0/0/2 - recessive model
4 - 0/2/0 - heterogeneous model.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Most human studies by default use normalised predictors (mean zero and variance one). This typically makes the assumption that all predictors (a priori) contribute equal variance explained, which for SNP data supposes that rarer SNPs have larger effect sizes (to compensate for their lower variance). However, LDAK allows you to consider a range of standardisations:

–power <float> (default -1) – predictor values are multiplied by Var(pred)^(<float>/2) Negative values suppose rarer predictors have larger effect sizes, positive that more common predictors have larger effects.

This option can be added whenever reading in data, for example, when calculating kinships, performing REML analysis using regional predictors, or constructing prediction models. In our paper, we found that the default standardisation is best when estimating variance explained (it is most robust to misspecification), however, we have found that higher values (e.g., --power 0) can be beneficial for prediction, which we suppose is because this places more weight on common predictors which are generally more reliably typed and easier to estimate effect sizes for.