This is the third post in a short series in response to an article by Tomkins and Bergman in their ongoing effort to downplay the genetic similarity between humans and chimpanzees. The subject of this post is a paper by The International Chimpanzee Chromosome 22 Consortium (led by H. Watanabe) in which they report a sequence difference (excluding indels) of 1.44%.
Here is what Tomkins and Bergman say about the paper:
The authors state a nucleotide substitution rate of 1.44% in aligned areas, but do not give similarity estimates to include indels. While indels are omitted from the alignment similarity, the authors indicate that there were 82,000 of them and provide a histogram that graphically shows the size distribution based on binned data groupings. Oddly, no data for average indel size or total indel length was provided. Likewise, the number of sequence gaps were given, but nothing about cumulative gap size. Despite the fact that supposedly well-sequenced orthologous chromosomal regions are being compared, specific data that would allow one to calculate overall DNA similarities are conspicuously absent. Based on an estimate using the limited graphical data provided regarding base substitutions and indels, a rough and fairly conservative estimate of about 80 to 85% overall similarity can be inferred (table 1).
Firstly, they seem to have somehow confused the “nearly 68,000 insertions or deletions” mentioned in the abstract with the number “82,000” above. Not sure how that slipped through the rigorous peer-review process at CMI, but mistakes can and do happen.
The second issue though is around how Tomkins and Bergman actually came up with this “80 to 85% overall similarity”. As they say, the authors of the paper don’t give explicit numbers on cumulative indel length, only some graphs from which you can try to estimate the totals. Tomkins and Bergman make use of that graphical data, and to come up with an 80 to 85% figure on a chromosome arm that is approximately 33.3Mbp long, they are effectively saying that indels make up between 4.52Mbp and 6.18Mbp of that sequence.
Did they pull that number out of their (shared) asshole?
Yes they did.
If you’re interested you can actually see the graphical data here; see in particular Figures 2 and 3:
While it’s not exactly easy to reverse engineer these figures to work out the total length of indels, it can be done with a reasonable degree of accuracy. In fact, I have painstakingly reproduced these figures by zooming in pixel-by-pixel, and recreating the underlying data. Here are my reproductions – feel free to compare them to the originals at the link above:
Based on my reproductions, I calculate that Figure 2 (which graphs indels less than 500bp long) has a cumulative total of around 562kbp of indels, while Figure 3 (which graphs indels between 500bp and 5000bp) has a cumulative total of around 119kbp of indels.
As for indels larger than 5,000bp, a nucmer analysis yields 27 indels with a cumulative length of 366kbp. That gives a total of just over 1Mbp.
Unfortunately we’re still nowhere near Tomkins and Bergman’s “rough and fairly conservative estimate” of 4.52 to 6.18Mbp. They seem to be off by a factor of 5. But like I said, mistakes happen. Especially when you’re a moron.
I’m sure you guys are able to do the simple math, but if we have a substitution difference of 1.44%, and an indel difference of 1,047kbp in a sequence of 33.3Mbp (equal to 3.14%), we have an overall similarity – including indels – of 95.42%.
By the way, if you’re really, really keen, you can actually download my spreadsheet that reverse engineers the figures from the paper in question. Enjoy.