Pseudo Summaries

To run MegaPRS, it is usually necessary to have two independent sets of summary statistics, one computed from training samples, and one from test samples. Ideally you will be able to generate these directly (e.g., if you have results from L cohorts in a meta-analysis, you can construct training summary statistics by combining results from L-1 cohorts, and use results from the final cohort as the test summary statistics). However, if you only have a single set of summary statistics, computed using all samples, this page explains how you can generate pseudo training and test summary statistics.

Please note that pseudo summary statistics do not work well when created from summary statistics that were subjected to genomic control. This is most problematic for results from meta-analyses that performed genomic control at the cohort-level (i.e., separately for each cohort). If you are using summary statistics that have undergone genomic control (and you are unable to reverse this), then you can instead use MegaPRS-Lite, that requires only one set of summary statistics.

This step requires a Reference Panel. Note that the samples you use for generating pseudo summary statistics must be different to those by MegaPRS (either to calculate predictor-predictor correlations or to test different prior distributions). Therefore, we suggest you divide the samples in your reference panel into three, then use one third to generate pseudo summary statistics, one third to calculate predictor-predictor correlations, and one third to test prior distributions.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --pseudo-summaries <outfile>.

The required options are

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--summary <sumsfile> - to specify the file containing the summary statistics.

--training-proportion <float> - to specify the fraction of samples assumed to be training (we recommend using --training-proportion 0.9).

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

The pseudo training and test summary statistics will be saved in <outfile>.train.summaries and <outfile>.test.summaries.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. Although we have individual-level data (genotypes and phenotypes for the same samples), for this example we pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

As we only have one reference panel, we begin by dividing its samples into three

awk < human.fam '(NR%3==1){print $0 > "keepa"}(NR%3==2){print $0 > "keepb"}(NR%3==0){print $0 > "keepc"}'

We will use keepa samples to create pseudo summary statistics, then use keepb and keepc with MegaPRS (to calculate predictor-predictor correlations and to test prior distributions, respectively). We can create pseudo test and training summary statistics by running

./ldak.out --pseudo-summaries quant --bfile human --summary quant.summaries --training-proportion .9 --keep keepa --allow-ambiguous YES

Here, we added --allow-ambiguous YES, so that LDAK does not exclude SNPs with alleles A & T or C & G (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent). The pseudo training and test summary statistics are saved in quant.train.summaries and quant.test.summaries.