Make Data

LDAK accepts many types of genetic data files (seevFile Formats). Here we explain how to remake data files. This can be used to convert between data formats (e.g., Ridge-Predict, Bolt-Predict and BayesR-Predict require that the data are saved in a binary format), to merge data (join two or more data files of the same format) and to reduce data (e.g, restrict to a subset of samples and/or predictors).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument determines the output data format:

Use --make-bed <outfile> to save data in Binary PLINK format.

Use --make-sp <outfile> to save data in (original) SP format.

Use --make-sped <outfile> to save data in old binary SP format.

Use --make-speed <outfile> to save data in new binary SP format.

Use --make-gen <outfile> to save data in gen format.

See Genetic Data Formats for details of each format.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

With each of these arguments, you must provide input genetic data. If you have only one input dataset, you should use --bfile <datastem>, --gen <datastem>, --sp <datastem>, --sped <datastem> or --speed <datastem>. If you have multiple genetic datasets (of the same format), you should use --mbfile <datalist>, --mgen <datalist>, --msp <datalist>, --msped <datalist> or --mspeed <datalist>. Note that when using --make-gen <outfile>, you must provide genotype probabilities (i.e, use either --gen <datastem> or --mgen <datastem>).

When providing multiple datasets, the file <datalist> will normally have three columns, providing the names of the files containing predictor values, the names of the files containing predictor annotations and the names of the files containing sample annotations. For example, if you wish to merge two datasets saved in Binary PLINK format with stems data1 and data2, then <datalist> should have two rows, containing "data1.bed data1.bim data1.fam", then "data2.bed data2.bim data2.fam". Note that if <datalist> has only one column, this should contain the stem of each dataset, and LDAK will add the default suffix to each.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

You can use --min-maf <float>, --max-maf <float>, --min-var <float> and --min-obs <float> to filter predictors based on minor allele frequency, variance or proportion of non-missing values. Note that while --min-var <float> and --min-obs <float> can be used for all types of data, --min-maf <float> and --max-maf <float> can only be used with SNP data (all predictor values are within [0,2]). Additionally, if the input genetic data contain genotyping probabilities, you can use --min-info <float> to filter predictors based on the LDAK information score. This score estimates the correlation squared between the observed genotypes and the true genotypes, and is formally described in the Methods of our paper Reevaluation of SNP heritability in complex human traits, Nature Genetics, 2017.

You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of sample, and --extract <extractfile>, --exclude <excludefile>, --chr <integer> and/or --snp <predname> to restrict to a subset of predictors (for more details, see Data Filtering).

By default, SNP values will be stored as 0/1/2, indicating the number of A1 alleles (or NA if missing). If you use --encoding DOM, values will be stored as 0/2/2. If you use --encoding REC, values will be stored as 0/0/2. If you use --encoding HET, values will be stored as 0/2/0. If you use --encoding MINOR, LDAK will ensure A1 is the minor allele, while if you use --encoding MISS, the value indicates whether the genotype was NA or not.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam from the Test Datasets
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

We can convert from binary PLINK format to SP format using

./ldak.out --make-sp human.sp --bfile human

The genetic data are now saved in the files human.sp.sp, human.sp.bim and human.sp.bim (note that human.sp.sp is a text file, so SP format can be used to view predictor values saved in binary formats)

We can convert from binary PLINK format to SPEED format using

./ldak.out --make-speed human.speed --bfile human

The genetic data are now saved in the files human.speed.speed, human.speed.bim and human.speed.bim.