Dr Jeffrey Tomkins is a Research Associate at the Institute for Creation Research. He has published a number of articles in the Answers Research Journal on the (alleged!) similarities between the human and chimpanzee genomes. In his “Comprehensive Analysis of Chimpanzee and Human Chromosomes Reveals Average DNA Similarity of 70%” of February 2013, he cites a study by Progetto Cosmo (an Intelligent Design organisation) that concludes “only an average 63% DNA identity (similarity) genome-wide”. It is this Progetto Cosmo study that is the subject of this post.
An Entirely Useless Test
The study can be found at http://progettocosmo.altervista.org/index.php?option=content&task=view&id=130. The study uses two methods to compare the genomes of humans and chimpanzees. The first method involves a complete “pairwise comparison (equality test)” which takes each human chromosome in turn and compares it to the corresponding chimpanzee chromosome, each as a single string of nucleotides. Obviously if there is an indel in the first few nucleotides 0f the chromosome, then this throws off the entire alignment: this method makes no attempt to try to realign the chromosomes. Rather unsurprisingly, this method shows only a 26% identity, which is what you would expect if you were comparing two entirely random strings of DNA. Why the author thought this would be a useful test to perform is not clear to me.
The Flawed Methodology
The second method involves randomly selecting 10,000 sequences of 30 base pairs in length from the chimpanzee chromosome, and trying to find a match in the corresponding human chromosome:
This additional test searches for shared 30-base-long patterns between two chromosomes. It might seem arbitrary to choose 30 base matches. It is arbitrary, as any other number would, but if the genomes were really 9x% identical as they say also a 30-base patterns comparison (or any other n-base patterns comparison) should get 9x% results.
This is completely incorrect, and the reason why it is incorrect becomes apparent when you look closely at the method of comparison. The author briefly discusses the BLAST tool, but decides that “a BLAST research has less sense”. For those that have some programming experience, the author makes his programming code available, and the line of code that does the actual comparison is line 84 of test2.pl:
$matchB++ if ($dataB =~ m|$pattern|);
The ‘m’ is perl’s match operator, and it searches for exact matches only. In other words, if only 29 out of the 30 nucleotides are identical then the match operator will discard it. Contrast this with the BLAST algorithm, where if there has been a single point mutation, it will calculate a 96.7% similarity.
It then becomes obvious that if you increase the length of the sequence from 30 base pairs to, say, 60 base pairs, then you are far more likely to encounter a point mutation, and therefore less likely to get a match. This is directly contradictory to the authors statement that the choice of 30 base pairs is “arbitrary” and that “any other n-base patterns comparison should get 9x% results”.
So if this method found that only 63% of 30 base pair sequences found a match within the corresponding chromosome, then we can try to work out how frequent these mutations must be across the genomes to achieve that 63% result.
An Excel Simulation
I have written some VBA code in Excel that performs a Monte Carlo simulation. My method creates a large array of boolean (“True”/”False”) values, and then, randomly changes a particular number of those values to mirror the effect of point mutations. So for example, I might have an array filled with 100,000 “False” values, and then go and change 1,000 of those values to “True”. This is effectively a 1% mutation rate (or conversely, a 99% identity).
It’s important to note that the “mutations” are not uniformly spaced – they are randomly spaced, and therefore you might see clusters of mutations close together as well as large sequences where there are no mutations at all.
I then choose a random index to the array, and check to see if I encounter any mutations in the next 30 spaces. If I encounter a mutation, then, like the Progetto Cosmo study, I discard it as a possible match. If there are no mutations, then it is an exact match. If I do this repeatedly (say, 10,000 times) then I have a good approximate probability of finding an exact match for a 30 base pair sequence given a particular mutation rate.
What I found is that with a 1% mutation rate (equivalent to 99% identity) I found an exact match approximately 74% of the time. If I used a mutation rate of 1.5%, I found an exact match approximately 63% of the time. This matches the 63% identity found in the Progetto Cosmo study, and, by working in reverse, I can conclude that to achieve such a result, the genomes must be approximately 1.5% dissimilar, or 98.5% identical.
The image below is a screenshot of an Excel table showing the correspondence between the mutation rate, and the similarity values you would calculate if you used the Progetto Cosmo methodology. The simulation was run ten times for the given mutation rate, and each simulation (that is, each cell in the ten columns) used an array of 100,000 values and performed 10,000 tests.
The VBA code behind this Excel sheet can be found here.
In conclusion, the Progetto Cosmo study’s result of 63% identity, is what everyone else would call 98.5% identity. So rather than dispelling the “the 99%-identity myth”, the author actually confirms it.
Caveat: this result of 98.5% is higher than the current consensus of around 95%-96% identity between human and chimpanzee genomes. The reason this discrepancy exists is due to the fact that neither the Progetto Cosmo nor my study took into account the possibility of indels. Once indels are accounted for, the result matches the consensus value of 95%-96%.