1000 Genomes Project

The 1000 Genomes Project sequenced 2504 samples from 26 populations across the world. We regularly use these data to infer ancestry (see Quality Control for more details). Ancestry-specific subsets of these data can also be used as a Reference Panel for SumHer (although in general, we recommend that the reference panel contains at least 2000 samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Here are some scripts for constructing a 1000 Genome dataset; to run them you will need to install both PLINK1.9 and PLINK2. Note that these scripts restrict to variants with minor allele frequency above 0.01 (which suffices when inferring ancestry or calculating taggings).

Download raw files

wget https://www.dropbox.com/s/y6ytfoybz48dc0u/all_phase3.pgen.zst
wget https://www.dropbox.com/s/odlexvo8fummcvt/all_phase3.pvar.zst
wget https://www.dropbox.com/s/6ppo144ikdzery5/phase3_corrected.psam

Use PLINK2 to decompress the pgen and pvar files

/home/doug/plink2 --zst-decompress all_phase3_ns.pgen.zst > all_phase3_ns.pgen
/home/doug/plink2 --zst-decompress all_phase3_ns.pvar.zst > all_phase3_ns.pvar

Use PLINK2 to convert to binary PLINK format, restricting to autsomal SNPs with MAF>0.01 (and excluding duplicates and SNPs with name ".")

echo "." > exclude.snps
./plink2 --make-bed --out raw --pgen all_phase3_ns.pgen --pvar all_phase3_ns.pvar --psam phase3_corrected.psam --maf 0.01 --autosome --snps-only just-acgt --max-alleles 2 --rm-dup exclude-all --exclude exclude.snps

The genotype data will now be stored in binary PLINK format in the files raw.bed, raw.bim and raw.fam. Insert population information and sex into the fam file. Replace predictor names with generic names of the form Chr:BP (not required, but I find this format more convenient). It is useful to save the original names.

awk '(NR==FNR){arr[$1]=$5"_"$6;ars[$1]=$4;next}{$1=$2;$2=arr[$1];$5=ars[$1];print $0}' phase3_corrected.psam raw.fam > clean.fam
awk < raw.bim '{$2=$1":"$4;print $0}' > clean.bim
awk < raw.bim '{print $1":"$4, $2}' > 1000g.names
cp raw.bed clean.bed

Download genetic distances, then insert these using PLINK1.9

wget https://www.dropbox.com/s/slchsd0uyd4hii8/genetic_map_b37.zip
unzip genetic_map_b37.zip
./plink1.9 --bfile clean --cm-map genetic_map_b37/genetic_map_chr@_combined_b37.txt --make-bed --out 1000g

If these scripts have run successfully, then your reference panel is saved in Binary PLINK format in the files 1000g.bed, 1000g.bim and 1000g.fam, (you can delete the files with prefixes raw and clean).