In our recent publication (link to appear here), we used TetraHer to identify 118 ICD10 diseases with significant heritability (P<0.05/229). Click here to see details of these diseases. This page explains how to obtain phenotypes for these 118 ICD10 diseases, using data from the UK Biobank.
The following scripts use the file icd10.sig (which can be downloaded by clicking here), and the UNIX program ukbconv (which can be downloaded by clicking here). They also assume you have applied for, received, and decrypted UK Biobank phenotype data (see this page for instructions). In this case, you will have a file called ukb23456.enc_ukb, where 23456 will replaced by your run ID.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Step 1. We converted the phenotype data using ukbconv and the command
ukbconv ukb23456.enc_ukb csv -oukb23456
mv ukb23456.enc_ukb.csv ukb23456.csv
These commands took about 20 minutes, and the result is a text file called ukb23456.csv.
Step 2. We constructed a file containing ICD10 codes for all UK Biobank individuals (in our case, this file has 502,487 rows and 214 columns). We did this using awk and the following three commands
head -n 1 ukb23456.csv | awk -v ff=$ff -v FS="\",\"" '{for(j=1;j<=NF;j++) \
{split($j,a,"-");if(a[1]==41270){print j}}}' > extract
W=`wc -l extract | awk '{print $1}'`
awk -v W=$W -v FS="\",\"" '(NR==FNR){a[$1];next}{printf "%s", $1;for (j in a){if($j=="") \
{$j="NA"};printf " %s", $j};printf "\n";}' extract ukb23456.csv | sed s/"\""/""/g > icd10.raw
The first command scans the first row of ukb23456.csv (which contains the field IDs) for columns starting with 41270 (which is the field ID for ICD10 codes). In our case, this found 213 columns, which are saved in a file called extract. The second command counts the number of elements of the file extract, and saves this to the variable W. The third command reads all rows of ukb23456.csv, printing out the first column (which contains individual IDs), plus the columns indexed in the file extract. In total, the three commands took about 20 minutes, and the output file is called icd10.raw. If you look at this file, you will see most elements are NA. This is OK, and reflects that most individuals have very few (e.g., 0, 1 or 2) of the ICD10 codes.
Step 4. The ICD10 codes are hierarchical. For example, if someone is recorded as having the Level 4 code A009, they also (automatically) have the Level 3 code A00 (the parent of code A009), and the Level 2 code A00-A09 (the parent of code A00). Therefore, to identify cases for each ICD10-defined disease, we must not only find all individuals with the corresponding code, but also individuals with any of the corresponding child codes. The file icd10.sig details the child codes for each ICD10 disease. You can construct a ICD10 disease phenotype using the following awk command
for j in {1..118}; do
name=`awk -v num=$j < icd10.sig '(NR==num){print $1}'`
awk -v num=$j < icd10.sig '(NR==num){for(j=1;j<=NF;j++){print $j}}' > pick
echo $j $name
awk '(NR==FNR){a[$1];next}(FNR>1){phen=0;for(j=2;j<=NF;j++){if($j in a) \
{phen=1}}; print $1, $1, phen}' pick icd10.raw > $name.pheno
done
The first commmand obtains the name of the disease. The second command extracts the ICD10 codes (both self and child) corresponding to the chosen disease. The third command generates the phenotype file. In total, the command takes about 20 minutes (to make all 118 phenotypes).
The phenotype files are now ready to be used with LDAK.