These scripts explain how to construct a reference panel from the 1000 Genomes Project (if you do not have one already), how to format summary statistics for use with SumHer, then how to estimate SNP heritability and confounding ##################### Acquire reference panel. Suppose our reference panel is stored in binary PLINK format in the files \verb|ref.bed|, \verb|ref.bim| and \verb|ref.fam|. Ideally, Column 3 of \verb|ref.bim| contains genetic distances (otherwise, replace \verb|--window-cm| with \verb|--window-kb| in the scripts below). The individuals in the reference panel should be ancestrally similar to those used in the GWAS. As we analyzed summary statistics from European-centric GWAS, we used 8\,850 unrelated Caucasian individuals from the Health and Retirement Study\cite{hrs} whose genotype data are available upon application from dbGaP (accession code phs000428.v2.p2). An alternative is to instead use the 404 non-Finnish Europeans from the 1000 Genomes Project,\cite{1000g} whose data can be obtained as followed. #Download sample IDs and extract Europeans wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel awk < integrated_call_samples_v3.20130502.ALL.panel '($3=="EUR" && $2!="FIN"){print $1, $1}' > eur.keep #Download data for each autosome, and convert using PLINK, extracting European individuals and SNPs with MAF>0.01 for j in {1..22}; do wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\ ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ./plink --vcf ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \ --make-bed --out chr$j --maf 0.01 --keep eur.keep done #Now join these together, excluding multi-allelic SNPs and those with duplicate positions rm list.txt; for j in {1..22}; do echo chr$j >> list.txt; done ./ldak5.linux --make-bed ref --mbfile list.txt --exclude-odd YES --exclude-dups YES ##################### Format summary statistics. For the analyses below, we use summary statistics for alzheimers\cite{alz_lambert} (stored in the file \verb|IGAP_stage_1.txt|, which can be downloaded from \blue[http://web.pasteur-lille.fr/en/recherche/u744/igap/igap\_download.php]) and for years of education\cite{years_okbay} (stored in \verb|EduYears_Main.txt.gz|, available at \blue[http://ssgac.org]). To use with SumHer, the summary statistic files should contain columns labelled \verb|Predictor|, \verb|A1|, \verb|A2|, \verb|Direction|, \verb|Stat| and \verb|n|. The names and alleles should be consistent with those in the reference panel; \verb|Direction| indicates whether the effect of the A1 allele is positive or negative (for quantitative traits, can use the effect size, for binary, the log odds); \verb|Stat| is the $\chi^2(1)$ test statistic (if only $p$-values are available, these can instead be provided in a column labelled \verb|P|), and \verb|n| is the number of individuals used when testing that SNP. The predictor names must be unique, so you should check for (and remove) duplicates. Having processed the summary statistic files, we also want to identify SNPs within the major histocompatibility complex (Chromosome\,6:\ 25-34\,Mb), as well as SNPs which individually explain $>$1\% of phenotypic variation, and SNPs in LD with these. If info scores are provided, we recommend excluding SNPs with score $<$0.95. #Tidy alzheimers summary statistics - the paper tells us there were 17008 cases and 37154 controls #If possible, compute a chi-squared test statistic (otherwise, could use P) #For linear regression, can use (beta/SD)^2, for logistic regression, (log.odds/SD)^2 awk < IGAP_stage_1.txt '(NR>1){snp=$3;a1=$4;a2=$5;dir=$6;stat=($6/$7)*($6/$7);n=17008+37154}(NR==1)\ {print "Predictor A1 A2 Direction Stat n"}(NR>1 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") \ && (a2=="A"||a2=="C"||a2=="G"||a2=="T")){print snp, a1, a2, dir, stat, n}' > alz.txt #Check for duplicates using the unix functions sort and uniq #If this command results in no output, it means all predictors are unique awk < alz.txt '{print $1}' | sort | uniq -d | head #If there are duplicates, we can remove then using mv alz.txt alz2.txt; awk '!seen[$1]++' alz2.txt > alz.txt #Tidy years education summary statistics - the paper tells us the total sample size was 328917 gunzip -c EduYears_Main.txt.gz | awk '(NR>1){snp=$1;a1=$4;a2=$5;dir=$7;stat=($7/$8)*($7/$8);n=328917}\ (NR==1){print "Predictor A1 A2 Direction Stat n"}(NR>1 && $10!="NA" && (a1=="A"||a1=="C"||a1=="G"||a1=="T") \ && (a2=="A"||a2=="C"||a2=="G"||a2=="T")){print snp, a1, a2, dir, stat, n}' > years.txt #Check for duplicates - here there are none awk < years.txt '{print $1}' | sort | uniq -d | head #Get list of MHC SNPs (from reference panel) awk < ref.bim '($1==6 && $4>25000000 && $4<34000000){print $2}' > mhc.snps #Identify large-effect SNPs (those explaining more than 1% of phenotypic variance) #This command uses the fact that the variance explained by each SNP is stat/(stat+n) awk < alz.txt '(NR>1 && $5>$6/99){print $1}' > alz.big awk < years.txt '(NR>1 && $5>$6/99){print $1}' > years.big #Note there are no large-effect SNPs for years education #Find SNPs tagging the alzheimers large-effect loci (within 1cM and correlation squared >.1) ./ldak5.linux --remove-tags alz --bfile ref --top-preds alz.big --window-cm 1 --min-cor .1 #Create exclusion files, containing mhc and (for alzheimers) SNPs tagging large-effect SNPs cat mhc.snps alz.out > alz.excl cat mhc.snps > years.excl ##################### Estimate SNP heritability. To estimate $\htsnp$ there are two steps: first we compute a (1-part) tagfile which contains $q_j + \sum_{l \in N_j} q_l r^2_{jl}$ for each SNP; then we perform the regression to estimate $\htsnp$. To compute an LDAK tagfile, we must first compute LDAK weights; this can take a few hours, but can be efficiently parallelized. #Calculate LDAK SNP weights; here, we calculate weights separately for each chromosome then merge #The on-screen instructions explain how to further parallelize (use --calc-weights not --calc-weights-all) for j in {1..22}; do ./ldak5.linux --cut-weights alz_chr$j --bfile ref --extract alz.txt --chr $j ./ldak5.linux --calc-weights-all alz_chr$j --bfile ref --extract alz.txt --chr $j done #Merge weights across chromosomes cat alz_chr{1..22}/weights.short > alz.weights #Calculate the (1-part) LDAK tagfile ./ldak5.linux --calc-tagging alz_ldak --bfile ref --extract alz.txt --weights alz.weights --power -.25 --window-cm 1 #Perform the regression (remember to exclude the MHC / large-effect SNPs) #Pay attention to the screen output (for this example it says I should add --check-sums NO) ./ldak5.linux --sum-hers alz_ldak --tagfile alz_ldak.tagging --summary alz.txt --exclude alz.excl #Estimates of SNP heritability will be in alz_ldak.hers #To instead use the GCTA Model, use --ignore-weights YES and --power -1 when computing the tagfile #This takes longer (as it uses all SNPs), but we can calculate separately for each chromosome then merge for j in {1..22}; do ./ldak5.linux --calc-tagging alz_gcta$j --bfile ref --extract alz.txt --ignore-weights YES \ --power -1 --window-cm 1 --chr $j done #Join the tagfiles across chromosomes rm list.txt; for j in {1..22}; do echo "alz_gcta$j.tagging" >> list.txt; done ./ldak5.linux --join-tagging alz_gcta --taglist list.txt #Then perform the regression (again will have to add --check-sums NO) ./ldak5.linux --sum-hers alz_gcta --tagfile alz_gcta.tagging --summary alz.txt --exclude alz.excl ##################### Estimate confounding bias. We recommend estimating the scaling factor $C$, but SumHer can also estimate the LDSC intercept $1+A$. There is no need to compute a new tagfile, just add \verb|--genomic-control YES| or \verb|--intercept YES| when performing the regression. #Estimate the scaling factor C ./ldak5.linux --sum-hers alz_ldak.gcon --tagfile alz_ldak.tagging --summary alz.txt --exclude alz.excl \ --genomic-control YES #The scaling factor will be in alz_ldak.gcon.extra #Estimate the intercept 1+A ./ldak5.linux --sum-hers alz_gcta.cept --tagfile alz_gcta.cept.tagging --summary alz.txt --exclude alz.excl \ --intercept YES #The intercept will be in alz_gcta.cept.extra