---------------------------- IMPUTATION------------------ ---------------------------- Suppose the genotypes for a cohort are stored in PLINK binary format: data.bed, data.bim, data.fam. First, we performed quality control of samples (based on heterozygosity and missingness) and SNPs (MAF, call-rate, HWE). To identify ancestry, we compared samples with HapMap data, stored in hapmap.bed, hapmap.bim, hapmap.fam. ------------------------------------------------------------------------------ #Sample quality control - we suggest using pruned, high-quality, autosomal SNPs plink --bfile data --indep-pairwise 50 10 .2 --maf 0.01 --geno 0.05 --autosome --out autoQC plink --bfile data --extract autoQC.prune.in --missing --out stats plink --bfile data --extract autoQC.prune.in --het --out stats #Based on stats.het and stats.imissing, identify outlying samples #Save those to retain in the file keepqc.ind #Check / determine sex (for human samples, this requires Chromosome 23 data) plink --bfile data --indep-pairwise 50 10 .2 --maf 0.01 --geno 0.05 --chr 23 --out sexQC plink --bfile data --extract sexQC.prune.in --check-sex --out sex #Get (non-ambiguous) SNPs in common with HapMap, compute population axes and projections #These scripts use awk, installed by default in unix - see www.dougspeed.com/awk for a brief guide awk '(($5!="A"||$6!="T")&&($5!="T"||$6!="A")&&($5!="C"||$6!="G")&&($5!="G"||$6!="C")){print $4}' \ hapmap.bim > nonamb.snps awk '(NR==FNR){arr[$1];next}($4 in arr && $1<23){print $2}' nonamb.snps data.bim > hapmap.snps plink --bfile hapmap --indep-pairwise 50 10 .2 --maf 0.05 --extract hapmap.snps --out hapmap ldak --calc-kins-direct hapmap --bfile hapmap --extract hapmap.prune.in --ignore-weights YES --power -0.25 ldak --pca hapmap --grm hapmap --axes 5 ldak --calc-pca-loads hapmap --bfile hapmap --grm hapmap --pcastem hapmap ldak --calc-scores hapmap --bfile data --scorefile hapmap.load --allow-flips YES --keep keepqc.ind --power 0 #Can compare Columns 5 & 7 of hapmap.profile with Columns 3 & 4 of hapmap.vect #Save those to retain to keepqcpop.ind #SNP quality control - will use only samples in keepqcpop.ind plink --bfile data --keep keepqcpop.ind --freq --out statsb plink --bfile data --keep keepqcpop.ind --missing --out statsb plink --bfile data --keep keepqcpop.ind --hardy --out statsb #Based on statsb.frq, statsb.lmissing and statsb.hwe, identify poor quality SNPs #Commonly used thresholds are MAF>0.01, CR>0.95 & HWE P>1e-6 #Save those to retain in the file keep.snps #Remake data retaining only good quality samples and SNPs, and updating sex plink --bfile data --keep keepqcpop.ind --extract keep.snps --update-sex sex.sexcheck 2 \ --make-bed --out clean ------------------------------------------------------------------------------ Next, it is necessary to ensure SNP annotations are correct and consistent with those in the reference panel. We suggest the following checks. (i) Update SNP names based on the file RsMergeArch, and exclude those reported as retired in the file SNPHistory (see LiftOver website for an explanation of these two files). (ii) If necessary, convert genomic positions to the correct build (our reference panel was Build 37 / hg 19). For this we obtained locations from Genome Browser. If a SNP name was not present in Genome Browser, we would initially exclude, but would re-introduce if present in the reference panel). We would also exclude SNPs whose alleles did not match those recorded in genome browser (allowing for strand flips). (iii) Check positions and alleles consistent with reference panel. In our case, the 1000 Genomes annotations agreed very well with those in Genome Browser, so there were typically at most a handful of SNPs with discordant positions or incompatible alleles. (iv) Decide whether to retain ambiguous SNPs (those with alleles A & T or C & G). For a non-ambiguous SNP, if the alleles in the cohort match those in 1000 Genomes, then the strands must also match; for ambiguous SNPs, this is not necessarily the case. Illumina genotyping arrays typically have very few (<5%) ambiguous SNPs, so we decided to exclude these to be certain of consistency. Affymetrix arrays typically have a higher proportion (10-20%), so we preferred to retain. In this case, we aligned SNPs to the forward strand (the alignment of 1000 Genomes) and checked the allele frequencies were highly concordant and that there were no obvious inversions. We performed the above checks manually in R. However, an alternative is to use the software provided on the LiftOver website. Note these checks are required even if subsequently using an imputation server. Next we phased data using SHAPEIT, then imputed using IMPUTE2. For this we divided the genome into 549 regions (518 auto- somal) of approximate size 5Mb (regions provided in the file allreg.txt). Note that we include instructions for imputing Chromosome X, however, for our analyses we considered only autosomal SNPs. Suppose the genotype data, with annotations updated, are stored in updated.bed, updated.bim and updated.fam. We provide approximate times and memory requirements assuming the cohort contains 5000 individuals. ------------------------------------------------------------------------------ #Divide data by chromosome for i in {1..23}; do plink --bfile updated --chr $i --make-bed --out split$i; done #Phase each chromosome using SHAPEIT #genetic_map_chr${i}_combined_b37.txt contains genetic mappings for Chromosome i #The option '--thread' controls the number of cores #'--effective-size 11418' is recommended for Europeans (for other populations see SHAPEIT website) for i in {1..23}; do shapeit -B split$i -M genetic_map_chr${i}_combined_b37.txt --thread 8 \ --effective-size 11418 -O phase$i done #Phased data for Chromosome i will be stored in phase$i.haps and phase$i.sample #Using 8 threads, the longest chromosomes would take 10-20 hours, and require <1Gb per thread #Impute in regions of approximately 5Mb using IMPUTE2 #allreg.txt contains the chromosome, start and end bp for 549 regions #Regions 1-518 are autosomes, 520-548 are Chromosome 23, 519 & 549 are pseudo-autosomal #1000GP_Phase3_chr$i.hap.gz contains phased genotypes for 1000 Genomes individuals #1000GP_Phase3_chr$i.legend.gz contains SNP annotations for these #It is possible to exclude SNPs based on MAF using the option `-filt_rules_l' - excluding SNPs showing no variation across European 1000 Genomes individuals substantially reduced memory requirements #See the IMPUTE2 website for explanation of other options #For the 518 autosomal regions for j in {1..518}; do chr=`awk -v j=$j '(NR==j){print $1}' allreg.txt` start=`awk -v j=$j '(NR==j){print $2}' allreg.txt` end=`awk -v j=$j '(NR==j){print $3}' allreg.txt` impute2 -m genetic_map_chr${chr}_combined_b37.txt \ -h 1000GP_Phase3_chr$i.hap.gz -l 1000GP_Phase3_chr$i.legend.gz \ -use_prephased_g -known_haps_g chr${chr}.haps -filt_rules_l 'EUR==0' \ -int $start $end -Ne 11418 -allow_large_reg -o_gz -o chunk$j done #For the Chromosome 23 regions, it is necessary to add -chrX (and edit the genetic map) for j in {520..548}; do start=`awk -v j=$j '(NR==j){print $2}' allreg.txt` end=`awk -v j=$j '(NR==j){print $3}' allreg.txt` impute2 -m genetic_map_chrX_nonPAR_combined_b37.txt -h 1000GP_Phase3_chrX_NONPAR.hap.gz -l 1000GP_Phase3_chrX_NONPAR.legend.gz \ -use_prephased_g -known_haps_g chr23.haps -filt_rules_l 'EUR==0' \ -int $start $end -Ne 11418 -allow_large_reg -o_gz -o chunk$j -chrX done #For the pseudo-autosomal regions, it is necessary to add -chrX and -Xpar #Note that imputation in these regions tends to be difficult as they tend to contain very few SNPs j=519 start=`awk -v j=$j '(NR==j){print $2}' allreg.txt` end=`awk -v j=$j '(NR==j){print $3}' allreg.txt` impute2 -m genetic_map_chrX_PAR1_combined_b37.txt \ -h 1000GP_Phase3_chrX_PAR1.hap.gz -l 1000GP_Phase3_chrX_PAR1.legend.gz \ -use_prephased_g -known_haps_g chr23.haps -filt_rules_l 'EUR==0' \ -int $start $end -Ne 11418 -allow_large_reg -o_gz -o chunk$j -chrX -Xpar j=549 start=`awk -v j=$j '(NR==j){print $2}' allreg.txt` end=`awk -v j=$j '(NR==j){print $3}' allreg.txt` impute2 -m genetic_map_chrX_PAR2_combined_b37.txt \ -h 1000GP_Phase3_chrX_PAR2.hap.gz -l 1000GP_Phase3_chrX_PAR2.legend.gz \ -use_prephased_g -known_haps_g chr23.haps -filt_rules_l 'EUR==0' \ -int $start $end -Ne 11418 -allow_large_reg -o_gz -o chunk$j -chrX -Xpar #Imputed data for Region j will be stored in chunk$j.gz and chunk${j}_info #Regions typically complete in 2-6 hours and require approximately 10Gb memory ------------------------------------------------------------------------------