Pseudo Summaries

To run MegaPRS, it is necessary to have two independent sets of summary statistics, one computed from training samples, and one from test samples. On rare occasions, you will have these (e.g., if you have results from L cohorts in a meta-analysis, you could construct training summary statistics by combining results from L-1 cohorts, and use results from the final cohort as the test summary statistics). However, most likely, you will only have a single set of summary statistics, computed using all samples. Here we explain how you can generate pseudo training and test summary statistics.

This step requires a Reference Panel. Note that the samples you use for generating pseudo summary statistics must be different to those by MegaPRS (either to calculate predictor-predictor correlations or to test different prior distributions). Therefore, we suggest you divide the samples in your reference panel into three, then use one third to generate pseudo summary statistics, one third to calculate predictor-predictor correlations, and one third to test prior distributions.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --pseudo-summaries <outfile>.

The required options are

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--summary <sumsfile> - to specify the file containing the summary statistics.

--training-proportion <float> - to specify the fraction of samples assumed to be training (we recommend using --training-proportion 0.9).

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

The pseudo training and test summary statistics will be saved in <outfile>.train.summaries and <outfile>.test.summaries.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. Although we have individual-level data (genotypes and phenotypes for the same samples), for this example we pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

As we only have one reference panel, we begin by dividing its samples into three

awk < human.fam '(NR%3==1){print $0 > "keepa"}(NR%3==2){print $0 > "keepb"}(NR%3==0){print $0 > "keepc"}'

We will use keepa samples to create pseudo summary statistics, then use keepb and keepc with MegaPRS (to calculate predictor-predictor correlations and to test prior distributions, respectively). We can create pseudo test and training summary statistics by running

./ldak.out --pseudo-summaries quant --bfile human --summary quant.summaries --training-proportion .9 --keep keepa --allow-ambiguous YES

Here, we added --allow-ambiguous YES, so that LDAK does not exclude SNPs with alleles A & T or C & G (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent). The pseudo training and test summary statistics are saved in quant.train.summaries and quant.test.summaries.