Data File Formats

LDAK accepts data in many formats. The most commonly-used is bed format (binary PLINK) which accommodates (hard-coded) SNP genotypes. Gen format is highly flexible; it is designed for reading genotype probabilities created by IMPUTE2, but can also accommodate, say, the output from other imputation software, haplotypes or non-genotype data (e.g., gene expression). If you are not using bed format, then I’d suggest converting to speed format (binary Sparse Partitioning), as this provides an efficient way to store arbitrary predictors.

To filter either samples or predictors, see the Data Filtering options.

When data record hard genotypes, alternative genotype codings, such as dominant and recessive models, can be considered using the option –encoding (see below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Bed format: –bfile <prefix>
Requires: <prefix>.bed, <prefix>.bim and <prefix>.fam (see the PLINK website for more details).

The bed file contains the SNP genotypes, coding as AA, AB, BB or missing, where A is the first allele and B is the second. The bim file has one row per SNP and six columns, which provide the chromosome, name, genetic and physical distance of each SNP, as well as the base pairs corresponding to the 1 and 0 alleles (note the genetic distance is not used so can be set to zero). The fam file has one row per individual and six columns, which provide the Individual ID, the Family ID, as well as Maternal and Paternal IDs, Sex and Phenotype. Note that LDAK only uses the first two IDs; the remaining four columns are ignored (so to use sex as a covariate or to provide phenotypic values, these need to be supplied separately).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Gen format: –gen <genfile>
Requires: –sample <samplefile> or –fam <famfile> and –oxford-single-chr <integer> or –bim <bimfile>
Optional: –gen-skip (default 0), –gen-headers (default 5) and –gen-probs (default 3)

The genfile contains predictor values, one row per predictor. –gen-skip indicates how many header rows (typically 0 or 1); –gen-headers indicates how many header columns (typically 0 to 5). –gen-probs should be 0, 1, 2, 3 or 4:
0 – haplotypes – predictor values should be “0 0”, “0 1”, “1 0” or “1 1”
1 – dosages – predictors provide the (expected) number of A alleles
2 – two probs – providing probability of being AA or AB
3 – three probs – providing probabilities of being AA, AB or BB
4 – four probs – providing probabilties of being AA, AB, BB or NA,
where A and B are the A1 and A2 alleles

The default gen format assumes the genfile has no header row, five header columns (blank, SNP)ID, BP, A1, A2) and provides probabilities of being AA, AB and BB, matching the output from IMPUTE2 (see the Oxford Stats webpages for details).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SP format: –sp <prefix>
Requires <prefix>.sp, <prefix>.bim, <prefix>.fam (see Sparse Partitioning website for more details).

SP format is a raw format where the main data file (.sp) has one row per predictor and one column per sample. If the genotypes are hard calls, element (i,j) would take value 0/1/2 depending on whether the jth sample is homozygous (major) , heterozygous or homozygous (minor) for the ith predictor. When using  dosage data, the values corresponds to the expected allele count:
data[i,j] = Prob(heterozyous) + 2*Prob(homozygous (minor)). However, the data need not correspond to allele counts, so, for example, gene expression and methylation data can be accommodated, and negative values are allowed.

The value “-99” denotes missing data (this value can be changed with the option –missingvalue <float>, see Extra Information). As with PLINK format, the bim and fam files contain details for the predictors and samples, respectively. Note that in the bim file, the alleles for each SNP are provided in the order minor than major. So if alleles for a SNP were C then G, then the genotype states 0, 1 and 2 would correspond to GG, CG and CC, respectively.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Binary SP format: –speed <prefix>
Requires <prefix>.speed, <prefix>.bim, <prefix>.fam

This is the same as SP format, except the main data file is in binary format. Values are stored as floats in row-major order; if there are 1000 samples, then the first 1000 values would correspond to the first predictor, the next 1000 values to the second predictor, and so on.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

To convert the data format to either Binary PLINK, SP or Binary SP format, use the arguments –make-bed <output>, –make-sp <output> or –make-speed <output>; files will be saved with the prefix <output>_out. Note that because Binary PLINK files only allow for hard genotypes (0/1/2), converting from dosage files will result in loss of information: if converting from chiamo format, the most likely state will be recorded, provided this state’s probability is higher than 0.95 (this value can be changed using –threshold); if converting from sp format, dosages <0.1 will be assigned genotype 0, dosages within (0.95,1.05) will be assigned genotype 1, dosages >1.9 will be assigned genotype 2, else the genotype will be considered missing. When converting data, by default all samples and predictors will be retained, but this can be changed using the options here.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Genotype codings: –encoding <1/2/3/4> (suitable only for hard count data)
This option determines how 0/1/2 hard genotype calls are recoded:
1 – 0/1/2 – additive model (default)
2 – 0/2/2 – dominant model
3 – 0/0/2 – recessive model
4 – 0/2/0 – heterogeneous model.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Most human studies by default use normalised predictors (mean zero and variance one). This typically makes the assumption that all predictors (a priori) contribute equal variance explained, which for SNP data supposes that rarer SNPs have larger effect sizes (to compensate for their lower variance). However, LDAK allows you to consider a range of standardisations:

–power <float> (default -1) – predictor values are multiplied by Var(pred)^(<float>/2) Negative values suppose rarer predictors have larger effect sizes, positive that more common predictors have larger effects.

This option can be added whenever reading in data, for example, when calculating kinships, performing REML analysis or constructing prediction models. In our paper, we found that the default standardisation is best when estimating variance explained (it is most robust to misspecification), however, we have found that higher values (e.g., –power 0) can be beneficial for prediction, which we suppose is because this places more weight on common predictors which are generally more reliably typed and easier to estimate effect sizes for.