1000 Genomes Project

The 1000 Genomes Project sequenced 2504 samples from 26 populations across the world. We regularly use these data to infer ancestry (see Quality Control for more details). Ancestry-specific subsets of these data can also be used as a Reference Panel for SumHer (although in general, we recommend that the reference panel contains at least 2000 samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Here are some scripts for constructing a 1000 Genome dataset; to run them you will need to install PLINK1.9. Note that to infer ancestry only requires common SNPs, so we restrict to variants with minor allele frequence above 0.01.

Download sample IDs

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

Download data for each autosome, and convert using PLINK, extracting SNPs with MAF>0.01

for j in {1..22}; do
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
./plink --make-bed --out chr$j --maf 0.01 --vcf ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Join these together (LDAK will automatically exclude multi-allelic SNPs)

rm list.txt; for j in {1..22}; do echo chr$j >> list.txt; done
./ldak.out --make-bed all --mbfile list.txt --exclude-same YES --exclude-dups YES

The genotype data will now be stored in binary PLINK format in the files all.bed, all.bim and all.fam.

Insert population information and sex into the fam file. Replace predictor names with generic names of the form Chr:BP (not required, but I find this format more convenient). It is useful to save the original names.

awk '(NR==FNR){arr[$1]=$2"_"$3;ars[$1]=1+($4=="female");next}{$2=arr[$1];$5=ars[$1];print $0}' integrated_call_samples_v3.20130502.ALL.panel all.fam > clean.fam
awk < all.bim '{$2=$1":"$4;print $0}' > clean.bim
awk < all.bim '{print $1":"$4, $2}' > 1000g.names
cp all.bed clean.bed

Download and insert genetic distances

wget https://www.dropbox.com/s/slchsd0uyd4hii8/genetic_map_b37.zip
unzip genetic_map_b37.zip

for j in {1..22}; do
./plink --bfile clean --chr $j --cm-map genetic_map_b37/genetic_map_chr@_combined_b37.txt --make-bed --out map$j

cat map{1..22}.bim | awk '{print $2, $3}' > map.all
awk '(NR==FNR){arr[$1]=$2;next}{print $1, $2, arr[$2], $4, $5, $6}' map.all clean.bim > 1000g.bim
cp clean.bed 1000g.bed
cp clean.fam 1000g.fam

If these scripts have run successfully, then your reference panel is saved in Binary PLINK format in the files 1000g.bed, 1000g.bim and 1000g.fam, (you can delete the files with prefixes ALL, chr, all, clean and map).