In January 2016, Dr Jeffrey Tomkins posted an article on ICR’s website claiming that the genetic gap between humans and chimpanzees is getting wider.
This time he cites a PLoS paper titled “Origins of De Novo Genes in Human and Chimpanzee”, and makes the following comment:
In yet another recent research report, scientists describe 634 orphan genes in humans and 780 in chimpanzees. In other words, we now have a new set of 1,307 genes that are completely different between humans and chimpanzees.
Now, it’s not the elementary math error that bothers me here (I can’t say I’m surprised any more by how sloppy Tomkins can be) it is the claim that these genes are “completely different” and that they “are found in no other type of creature and therefore have no evolutionary history”.
The authors have kindly made their data available, so I thought I might test Tomkins’ claim, and – as a few people have suggested – I’ll describe the process so that any of you can replicate the results if you are so inclined.
If you’re not so inclined you can probably skip to the end.
Step 1 – Where Are These Genes?
If you scroll down about halfway in the Ruiz-Orera paper, you’ll see links to two GTF files:
- http://dx.doi.org/10.6084/m9.figshare.1604892 (human)
- http://dx.doi.org/10.6084/m9.figshare.1604893 (chimpanzee)
For now we’ll just be searching for the human genes; try to find a homolog in the chimpanzee genome. So click on the human link, and download the file – hsa_denovo.gtf.
If you have a look inside these files you’ll see the coordinates for each coding sequence / exon … but no DNA sequence. Sad Panda.
Step 2 – Get The DNA Sequence. Obviously.
Now this part isn’t trivial. I wrote some code in C++ that took this GTF file as input, parsed out the coordinates for each sequence, carved that sequence from the human genome, and then created one FASTA file per chromosome.
If you’re interested in looking at the code, you will find it here:
I decided to exclude any sequences shorter than 30 base pairs because short sequences will almost certainly find a match somewhere on the corresponding chimpanzee chromosome, and that would only serve to inflate the final result.
Step 3 – BLAST Away!
For each FASTA file we run a BLAST against the corresponding chimpanzee chromosome – ‘run.sh‘ takes care of this by calling ‘blast_chrome.sh‘ for each FASTA file. Depending on how powerful your machine is, this part might take a few hours. At the end you should have 24 CSV files with the results – one for each of the autosomes, and then two more for the X and Y chromosomes.
Step 4 – Make Sense Of The Results
The ‘analyse.sh‘ script reads in these CSV files and prints out some statistics for each chromosome. The two most important columns are the number of nucleotides I queried and the number of identical nucleotides I actually found. I also count the number of queries I submitted and how many results I got.
Drum Roll, Please.
Who would have thunk it? These human genes which supposedly had no evolutionary history have corresponding sequence in the chimpanzee genome which is – on average – about 95.41% identical.
But why is this post critical of Tomkins and not of the PLoS paper? I suggest reading the PLoS paper first, then reading Tomkins article, and let me know in the comments just how sloppy Tomkins is in his interpretations and conclusions.