Pseudo training and test summary statistics mimic the summary statistics we would obtain if we repeated a GWAS using subsets of samples. For example, suppose we have summary statistics from a GWAS of 100,000 samples. We can create pseudo training and test summary statistics that are similar to those we would obtain if we had analysed only the first 90,000 and last 10,000 samples, respectively.
We use pseudo summary statistics to determine suitable model parameters when running MegaPRS (specifically, we use pseudo training summary statistics to construct prediction models corresponding to a variety of parameters, then use pseudo test summary statistics to measure the accuracy of these models). Please note that when running MegaPRS, it is no longer necessary to explicitly generate pseudo summary statistics (this is because these are now generated internally by LDAK when using the command --mega-prs <outfile> to estimate effect sizes).
The instructions below step requires a (well-matched) reference panel. Ideally, this should contain at least 2000 samples (otherwise, Reference Panel provides scripts for constructing a smaller panel from 1000 Genome Project data).
Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The main argument is --pseudo-summaries <outfile>.
The required options are
--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).
--summary <sumsfile> - to specify the file containing the summary statistics.
--training-proportion <float> - to specify the fraction of samples assumed to be training (we recommend using --training-proportion 0.9).
You can use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples, and --extract <extractfile> and/or --exclude <excludefile> to restrict to a subset of predictors (for more details, see Data Filtering).
By default, LDAK will ignore predictors with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.
The pseudo training and test summary statistics will be saved in <outfile>.train.summaries and <outfile>.test.summaries.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Example:
Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. Although we have individual-level data (genotypes and phenotypes for the same samples), for this example we pretend we are using summary statistics. Therefore, we will first create summary statistics by running
./ldak.out --linear quant --bfile human --pheno quant.pheno
The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
We can create pseudo test and training summary statistics by running
./ldak.out --pseudo-summaries quant --bfile human --summary quant.summaries --training-proportion .9 --allow-ambiguous YES
Here, we added --allow-ambiguous YES, so that LDAK does not exclude SNPs with alleles A & T or C & G (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent). The pseudo training and test summary statistics are saved in quant.train.summaries and quant.test.summaries.