Get Weightings

The first task is to calculate weightings for each predictor. This is performed in three steps: cutting the predictors into SECTIONS; calculating weightings for each SECTION; joining the weightings across SECTIONS. The arguments for performing these steps are shown below.

Note, that previously (LDAK4), we recommended calculating weightings twice, the second time using only those SNPs with non-zero weights from the first analysis; this was because with dense data, the number of SNPs in each section would be more than LDAK could handle. With LDAK5, this is no longer required (instead, when cutting, LDAK5 thins duplicate SNPs, which reduces the number of SNPs in each section to a manageable leve).

Options in red are REQUIRED; options in purple are OPTIONAL. If you wish to only analyse a subset of the data, see Data Filtering. In all cases, <folder> is the directory in which output files will be written.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

–cut-weights <folder>

–bfile/–chiamo/–sp/–speed <prefix> – specifies the datafiles (see File Formats).

By default, LDAK will first thin predictors, using an r-squared threshold of 0.99. This threshold should suffice, but can be changed using –window-prune <r-squared_threshold>. You can turn off thinning by adding –no-thin YES, or if you have previously thinned, add –no-thin DONE (in which case, LDAK will expect to find a file called thin.in)

Having thinned predictors, LDAK will cut into sections. You can change aspects of the cutting with the options –section-length <num_of_predictors> and –buffer-kb <length_in_kb> or –buffer-length <num_of_predictors> (see Advanced Options). In most cases, the default settings will suffice, but please read the screen output to see whether changes are suggested.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

–calc-weights <folder>

–bfile/–chiamo/–sp/–speed <prefix> – specifies the datafile (see File Formats).
–section <number> – specifies which SECTION to consider.

By default, the simplex algorithm will run for a maximum of (approximately) 360 minutes or 200,000 iterations, at which point LDAK will give up and instead compute approximate weights (see below). These limits can be changed using –maxtime or –maxiter, (see Advanced Options).

It is possible to calculate weightings using a crude approximation with the option –quick-weights YES. These crude weights will generally be better than using no weights,  but should only be used as a last resort; for example, LDAK uses this approximation when the simplex algorithm fails to complete in the specified time / number of iterations.

When the outcome is binary and (subsets of) cases and controls have been genotyped separately, it is advisable to calculate correlations separately over cases and controls by adding –subset-number <integer> and –subset-prefix <prefix>. See Subset Options.

To model the decay of LD with distance use –decay YES. Since LDAK3.0, this feature is now turned off by default, but its use is recommended when analysing (highly) related individuals. See LD Decay for more details.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

–join-weights <folder>

No other arguments used.

The combined weights will be stored in the file <folder>/weightsALL. See Tips for a description of this file. If weights are missing for some sections, LDAK will fail to complete, but will write a list of affected sections to <folder>/section_missings.txt. LDAK will report for which sections approximate weights were computed in <folder>/section_approximations.txt. In my experience, if a section fails to complete when using a cluster, this is usually because the job was allocated to a “slow node”. So I would suggest trying a second time before resorting to using –quick-weights YES.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The first step can take hours; the screen output (or the file <folder>/cut.progress) will indicate how long left.  The second step is readily parallelized (see the example below). With reasonable sample size (say, >5000), most sections will run in less than an hour. However, when running on a cluster, I recommend allocating 12 hours, as this should be sufficient time for all sections to either finish, or for the time limit to apply. The third step should take no more than a couple of minutes.

For the Mac version, the option –workdir becomes mandatory. See Advanced Options.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example, using the binary PLINK files test.bed, test.bim and test.fam available in the Test Datasets:

../ldak.out –cut-weights sections –bfile test

The software obtains the number of SNPs from test.bim and, using the default section length (3000 predictors), determines they should be divided into two sections of 2500 SNPs. A buffer (default 500 predictors) is added to all internal breakpoints, so in fact weights will be calculated across SNPs 1-3000, then 2001 to 5000.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

../ldak.out –calc-weights sections –bfile test –section 1
../ldak.out –calc-weights sections –bfile test –section 2

(This step takes about 5 minutes. The top command calculates weights for the first section, the bottom command for the second. By default, the LD decay function is turned off. A possible script for parallelising this step is:

#!/bin/bash
#$ -t 1-2
number=$SGE_TASK_ID
../ldak.out –calc-weights sections –bfile test –section $number
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

../ldak.out –join-weights sections

The final weightings will be stored in sections/weightsALL.