This page contains some files that might be useful when running LDAK.
If you are looking for the data used in the examples on this website, see Test Datasets, while if you would like the annotations required to construct the BLD-LDAK Heritability Model, you should see BLD-LDAK Annotations.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
1000 Genome Project Reference Panels:
I generated European, American, South Asian, East Asian and African reference panels by following the scripts on Reference Panel (first for non-Finnish Europeans, then Americans, then South Asians, then East Asians, then African individuals). Each panel contains between 7.5M and 14.8M SNPs (all with MAF>0.01) and between 347 and 661 individuals. Note that these datasets are relatively small; ideally your reference panel should contain at least 2000 individuals.
If you are using a UNIX operating system, you can download a file by typing wget followed by its web address (you can obtain the web address of the files below by right-clicking on the corresponding links). Having downloaded a set of files, you should extract it. On a UNIX operating system, you can extract a set of files by typing tar -xzvf followed by the file name.
404 non-Finnish European individuals
347 American individuals
489 South Asian individuals
504 East Asian individuals
661 African individuals
Each set of files contains genetic data in bed format (see File Formats), and details of the original SNP names (when constructing the reference panels, I converted the SNP names to the form Chr:BP, using positions from the GRCh37/hg19 assembly).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Gene Annotation Files:
RefSeq Annotations (Assembly GRCh37)
RefSeq Annotations (Assembly GRCh38)
I obtained these gene annotations from the UCSC Genome Browser Tables Browser. Having picked the species and genome assembly, I selected the group "Genes and Gene Predictions" and the track "NCBI RefSeq". I made sure "genome" was ticked (next to region), then chose "selected fields from primary and related tables" and entered the output filename "refseq". Having clicked "get output", I ticked the fields "chrom", "strand", "txStart", "txEnd" and "name2" (you may prefer to instead tick "chrom", "strand", "cdsStart", "cdsEnd" and "name2"). Finally, I clicked "get output" to download the file (its size was about 3Mb).
I then processed the downloaded file using the following unix commands (you can read more about awk here).
rm refseq.clean
for j in {1..22}; do
awk -v j=$j '($1=="chr"j){print $5, j, $3, $4, $4-$3, $2}' refseq | sort -r -n -k 5,5 | awk '(!seen[$1]++){print $1, $2, $3, $4, $6}' | sort -n -k 3,3 >> refseq.clean
done
The file refseq.clean is now in the format required by LDAK. Note that this command excluded a few genes with unusual chromosome names (e.g., gene "LOC389831" has chromosome "chr7_gl000195_random"). Further, for genes with the same name, it only kept the longest. Finally, it ensured that genes were ordered by chromosome then start basepair.