Tomkins Human-Chimp DNA – Watanabe 2004

This is the third post in a short series in response to an article by Tomkins and Bergman in their ongoing effort to downplay the genetic similarity between humans and chimpanzees. The subject of this post is a paper by The International Chimpanzee Chromosome 22 Consortium (led by H. Watanabe) in which they report a sequence difference (excluding indels) of 1.44%.

Here is what Tomkins and Bergman say about the paper:

The authors state a nucleotide substitution rate of 1.44% in aligned areas, but do not give similarity estimates to include indels. While indels are omitted from the alignment similarity, the authors indicate that there were 82,000 of them and provide a histogram that graphically shows the size distribution based on binned data groupings. Oddly, no data for average indel size or total indel length was provided. Likewise, the number of sequence gaps were given, but nothing about cumulative gap size. Despite the fact that supposedly well-sequenced orthologous chromosomal regions are being compared, specific data that would allow one to calculate overall DNA similarities are conspicuously absent. Based on an estimate using the limited graphical data provided regarding base substitutions and indels, a rough and fairly conservative estimate of about 80 to 85% overall similarity can be inferred (table 1).


Firstly, they seem to have somehow confused the “nearly 68,000 insertions or deletions” mentioned in the abstract with the number “82,000” above. Not sure how that slipped through the rigorous peer-review process at CMI, but mistakes can and do happen.

The second issue though is around how Tomkins and Bergman actually came up with this “80 to 85% overall similarity”. As they say, the authors of the paper don’t give explicit numbers on cumulative indel length, only some graphs from which you can try to estimate the totals. Tomkins and Bergman make use of that graphical data, and to come up with an 80 to 85% figure on a chromosome arm that is approximately 33.3Mbp long, they are effectively saying that indels make up between 4.52Mbp and 6.18Mbp of that sequence.

Did they pull that number out of their (shared) asshole?

Why, yes.

Yes they did.

If you’re interested you can actually see the graphical data here; see in particular Figures 2 and 3:

While it’s not exactly easy to reverse engineer these figures to work out the total length of indels, it can be done with a reasonable degree of accuracy. In fact, I have painstakingly reproduced these figures by zooming in pixel-by-pixel, and recreating the underlying data. Here are my reproductions – feel free to compare them to the originals at the link above:

Based on my reproductions, I calculate that Figure 2 (which graphs indels less than 500bp long) has a cumulative total of around 562kbp of indels, while Figure 3 (which graphs indels between 500bp and 5000bp) has a cumulative total of around 119kbp of indels.

As for indels larger than 5,000bp, a nucmer analysis yields 27 indels with a cumulative length of 366kbp. That gives a total of just over 1Mbp.

Unfortunately we’re still nowhere near Tomkins and Bergman’s “rough and fairly conservative estimate” of 4.52 to 6.18Mbp. They seem to be off by a factor of 5. But like I said, mistakes happen. Especially when you’re a moron.

I’m sure you guys are able to do the simple math, but if we have a substitution difference of 1.44%, and an indel difference of 1,047kbp in a sequence of 33.3Mbp (equal to 3.14%), we have an overall similarity – including indels – of 95.42%.

By the way, if you’re really, really keen, you can actually download my spreadsheet that reverse engineers the figures from the paper in question. Enjoy.


Tomkins Human-Chimp DNA – Ebersberger 2002

This is the second post in a short series in response to an article by Tomkins and Bergman in their ongoing effort to downplay the genetic similarity between humans and chimpanzees. The subject of this post is a paper by Ingo Ebersberger published in 2002, in which they report a sequence difference (excluding indels) of 1.24%.

But let’s first look at what Tomkins and Bergman say about this paper:

“Researchers selected two-thirds of the total sequence for more detailed analyses. One-third of the chimp sequence would not align to the human genome and was discarded. […] Not surprisingly, they report only a 1.24% difference in only highly similar aligned areas between human and chimp. A more realistic sequence similarity based on the researchers’ own numbers for discarded data in the alignments alone is not more than 65%.”


Tomkins and Bergman seem to be taking this one-third of chimp sequence that supposedly does not align – thereby giving a maximum identity of 66.6%, and then knocking off the 1.24% substitution rate to get their 65% approximation.

Now, let’s have a look at what the paper actually says:

“Twenty-eight percent of the total amount of sequence was excluded from the analysis, since the entire sequence, or parts of it, displayed more than one match in the human genome that was not due to known families of repeated sequences. For 7% of the chimpanzee sequences, no region with similarity could be detected in the human genome.”


More than one match? Really? But I thought Tomkins and Bergman said it would not align at all? Surely an honest mistake by these two competent researchers.

But what about the 7% that legitimately couldn’t find a match?

I think I have a reasonable explanation for that. In the Materials and Methods section, Ebersberger mentions that they are using a “draft version of the human genome (freeze August 6, 2001)”.

It’s hard to find solid data on that version of the human genome (known as hg8) but as best as I can tell, it was somewhere between 90% and 95% complete. So when the Ebersberger paper was published, around 5%-10% of the human genome wasn’t yet complete. It’s therefore not surprising that a sizeable percentage of the chimpanzee sequences could not be found.

Can we compare these chimp sequences to GRCh38?

Unfortunately, no. I contacted Ingo Ebersberger in late 2014 and asked if he still had the data:

“I don’t think that I will be able to find the original data and no other co-author will have them.”


Tomkins Human-Chimp DNA – Britten 2002

Way back in 2012, Jeff Tomkins and Jerry Bergman teamed up to “re-evaluate” the published scientific literature related to the overall genetic difference between humans and chimpanzees. Their article can be found here:

Their claim, in a nutshell, is that the results of many of these papers are overstated for various reasons, and Tomkins and Bergman have taken it upon themselves to correct the results.

Britten, 2002

One such study is by Roy Britten, in 2002, and can be found here:

As you can see, Britten reported an overall DNA similarity of around 95.2% for his small sample, which was comprised of the only 5 completed Chimpanzee BACs in existence at the time.

In the paper, you will notice that Britten only reports on around 779kb of sequence, while Tomkins and Bergman quite rightly point out that the sum of the lengths of those BACs is around 846kb. Does the excluded sequence align to the human genome? Tomkins and Bergman say no, and use that to scale back the overall DNA similarity to 87.7%.

But did they actually check?

No. Of course they didn’t. But I did! Now, Britten doesn’t really give a good explanation for why some of the sequence was excluded, and unfortunately he passed away at the age of 92 not long after Tomkins and Bergman published their article.

So here are the alignment results from the nucmer/MUMmer software package when I compared the five BACs to the human genome.

AC006582 – 186,092bp

As you can see from the S2 and E2 columns, the entire length of the BAC aligns to the human genome.

    [S1]     [E1] |   [S2]   [E2] | [LEN 1] [LEN 2] | [% IDY] 
20789281 20810917 | 186092 164450 |   21637   21643 |   98.05 
20813855 20821723 | 164445 156609 |    7869    7837 |   96.71 
20822132 20891846 | 156608  87016 |   69715   69593 |   98.06 
20891781 20896388 |  87017  82414 |    4608    4604 |   98.72 
67928585 67976610 |      1  48038 |   48026   48038 |   98.32 
67976609 68010902 |  48159  82411 |   34294   34253 |   98.46

You can also see from the S1 and E1 columns that this BAC matches to two locations on human chromosome 12, and Britten makes note of this in his paper. You could also check the synteny map from (reproduced below) clearly showing the relevant breakpoints.


So what is the overall percentage identity of this BAC to the human genome? Well, first of all we need to work out the number of nucleotides that match. We do this by multiplying the percentage identity of each match (“% IDY”) by the length of that match (“L2”):

(21,643 * 98.05%) + (7,837 * 96.71%) + (69,593 * 98.06%) + ... = 182,545 nt

But do we use the overall length of the BAC as the divisor in our equation, or do we use the length of the human sequence that it spans? If we want to be as conservative as possible,  we’ll use the maximum of the two. As above, the length of the BAC is 186,092bp, while the combined length of the two syntenic regions of human chromosome 12 (columns S1 and E1) is 189,426bp.

Therefore our overall similarity for this BAC is 96.37%.

AC007214 – 154,685bp

    [S1]     [E1] |   [S2]   [E2] | [LEN 1] [LEN 2] | [% IDY] 
20810108 20816825 | 147956 154685 |    6718    6730 |   97.50 
67987724 67997814 | 147926 137755 |   10091   10172 |   98.04 
67997818 68060869 | 137491  74530 |   63052   62962 |   98.45 
68060856 68124237 |  70532   7232 |   63382   63301 |   98.28 
68124247 68131497 |   7257      1 |    7251    7257 |   97.95

How many nucleotides did we match?

(6,730 * 97.50%) + (10,172 * 98.04%) + (62,962 * 98.45%) + ... = 147,841 nt

By looking at the S2 and E2 columns, you’ll note that there is quite a large indel – approximately 4,000 bases long, which Britten has already taken into account in his results. So, once again, the “missing” sequence that Tomkins and Bergman claim does not align, clearly does align. I wonder if we will see that trend continue?

To calculate our identity, what is the longer span of the two? The chimpanzee BAC is 154,685 bases, and this is longer than the corresponding sequence on the human chromosome, which is only 150,491 bp.

Therefore our overall similarity for this BAC is 95.58%.

AC097335 – 148,984bp

     [S1]      [E1] |   [S2]   [E2] | [LEN 1] [LEN 2] | [% IDY]
122584064 122621080 |      1  37111 |   37017   37111 |   97.40 
122621388 122657017 |  37102  72697 |   35630   35596 |   97.56 
122657439 122694830 |  72687 110041 |   37392   37355 |   97.74 
122694662 122718458 | 110000 133714 |   23797   23715 |   97.51 
122722001 122737262 | 133712 148984 |   15262   15273 |   98.00

How many nucleotides did we match?

(37,111 * 97.40%) + (35,596 * 97.56%) + (37,355 * 97.74%) + ... = 145,476 nt

And for our maximum span in this case, it is the human spanning sequence – 153,199bp – that is longer than the chimpanzee BAC – 148,984bp.

Therefore our overall similarity for this BAC is 94.96%.

AC096630 – 160,603bp

    [S1]     [E1] |   [S2]   [E2] | [LEN 1] [LEN 2] | [% IDY] 
23929738 23942644 |      1  12893 |   12907   12893 |   97.54 
23942646 23943694 |  13054  14094 |    1049    1041 |   97.34 
23944199 23988365 |  14096  58297 |   44167   44202 |   98.29 
23987821 23990092 |  59219  61499 |    2272    2281 |   96.15 
23990410 24090254 |  60891 160595 |   99845   99705 |   98.10

How many nucleotides did we match?

(12,893 * 97.54%) + (1,041 * 97.34%) + (44,202 * 98.29%) + ... = 157,039 nt

Now, this one needs a little adjustment. Notice that there is some overlap between S2 and E2 on the 4th and 5th results? We don’t want to double-count any nucleotide matches, so we need to scale this back a little. The weighted percentage identity of these matches is 98.0746% and there are 609 bp of overlap. We’ll reduce our nuleotide matches by 597, down to 156,442 nt.

Our longest span in this case is – by a narrow margin – the chimpanzee BAC at 160,603bp.

Therefore our overall similarity for this BAC is 97.41%.

AC093572 – 195,652bp

Well, this is our last BAC, and as we’ve seen, all the previous BACs have aligned completely to the human genome. I think we’re getting close to figuring out whether Tomkins and Bergman “re-evaluation” is valid! I’d give a spoiler alert, but the answer is literally the next thing you’ll see.

    [S1]     [E1] |   [S2]   [E2] | [LEN 1] [LEN 2] | [% IDY] 
24117704 24126528 | 195644 186806 |    8825    8839 |   99.24 
24126542 24130967 | 186870 182455 |    4426    4416 |   99.30 
24133114 24222222 | 182455  93337 |   89109   89119 |   98.60 
24221913 24239932 |  92705  74808 |   18020   17898 |   97.56 
24239920 24251188 |  73900  62572 |   11269   11329 |   96.85 
24250869 24252922 |  61415  59363 |    2054    2053 |   98.54 
24253250 24270233 |  59964  42959 |   16984   17006 |   97.93 
24270373 24290547 |  42953  22803 |   20175   20151 |   98.17 
24290532 24312416 |  21904      1 |   21885   21904 |   98.40

How many nucleotides did we match?

(8,839 * 99.24%) + (4,416 * 99.30%) + (89,119 * 98.60%) + ... = 189,474 nt

Again in this case, we need to scale back that number of nucleotides because there is significant overlap (668bp) in the results. Our reduced nucleotide match is now 188,817 nt. The longer of the two spans is the chimpanzee BAC: 195,652bp.

Therefore our overall similarity for this BAC is 96.51%.


Surprise, surprise, Tomkins and Bergman didn’t bother to back up their assertions and they’ve been shown to be wrong. For completeness, here is the table that calculates the weighted average percentage identity of all five BACs.

BAC Matches MaxSpan Identity
AC006582        182,545        189,426 96.37%
AC007214        147,841        154,685 95.58%
AC097335        145,476        153,199 94.96%
AC096630        156,442        160,603 97.41%
AC093572        188,817        195,652 96.51%






DNA Contamination – Laying Some Groundwork

In Dr Tomkins’ November 2016 paper, he claims that there is widespread human DNA contamination in the chimpanzee genome, and that this contamination inflates the overall percentage similarity between the two genomes.

The paper can be found here:

Of particular interest is this quote:

“When blasting chimpanzee trace reads onto an allegedly accurate representation of the chimpanzee genome, one would expect alignment identities of 99.9 to 100%.”

No, this is decidedly not the case, and since Dr Tomkins has previously written on the topic of genome assembly he must know that this is not the case.

When genomes are assembled, there is usually an enormous amount of sequence data used – many times more than the size of the genome itself. So, for example, if the human genome is approximately 3 billion bases and we have obtained 18 billion bases of total sequence, we would say that we have a “6-fold redundant assembly”. What this means is that – on average – every base pair in the genome is covered 6 times.

Why do we need this redundancy? Simply because the individual reads themselves are not “99.9 to 100%” accurate. Genome assembly software attempts to overlap all these individual reads in order to come to a consensus on what the actual sequence is.

So, let’s say that for a particular nucleotide position in the sequence, four of those overlapping reads think that it is an Adenine (an “A”) while the fifth read thinks that it is a Guanine (“G”), and the sixth read thinks it is a Thymine (“T”). The software makes an educated guess that the true nucleotide value at that position is indeed an “A”, and considers the other two reads as having an error at that position.
But in reality, it is much more than a simple “majority rules” decision, because each base in each read also has a corresponding quality score. This quality score (usually referred to as a “Phred score”) is a measure of the confidence of the DNA sequencing machine in recording each base. You can read more about Phred scores (Wikipedia is usually a good starting point but for the purposes of this series of posts it is enough to know that a Phred score of 10 is low quality (implies an accuracy of 90%) while a Phred score of 60 is very high quality (implies an accuracy of 99.9999%).

Testing Dr Tomkins’ claim

We have three sources of data that we can use to test Dr Tomkins’ claim of human DNA contamination.

  1. The current human genome assembly.
  2. The current chimpanzee genome assembly.
  3. The raw chimpanzee trace read data sets used by Dr Tomkins.

What should we expect to see if we compare a raw chimpanzee trace read to the human genome assembly? What should we expect to see if we compare a raw chimpanzee trace read to the chimpanzee assembly?

With Dr Tomkins’ quote at the top of this post in mind – is he able to produce a single trace read from the archives that is a “99.9 to 100%” match to the human genome, but a lesser match to the chimpanzee genome? If so, did that read make it into the chimpanzee genome? If human DNA contamination is indeed a problem, then this should be a fairly trivial challenge.

Stay tuned.

Chromosome 2 Fusion – Dicentric Inactivation

From Jeff Tomkins’ most recent paper that supposedly addresses the criticisms leveled at his work by myself and others:

“Another problem with the alleged cryptic centromere is its short length. The cryptic centromere site is extremely small compared to a real centromere — it is only 41,608 bases in length […] – a fraction of the size of human centromeres that range in length between 250,000 and 5,000,000 bases (Aldrup-Macdonald and Sullivan 2014). Thus, if this was in fact a relic centromere of an ancient chromosome fusion, its size should be greater than six times its current length at the minimum.”

From “Debunking the Debunkers: A Response to Criticism and Obfuscation Regarding Refutation of the Human Chromosome 2 Fusion”

No, the only problem here is Jeff Tomkins’ lack of familiarity with the literature and/or his complete unwillingness to research the topic.

In 2012, an excellent review paper discuss centromere inactivation in dicentric chromosomes (i.e. those chromosomes with two centromeres due to a fusion event). You can read the paper here: Dicentric chromosomes: unique models to study centromere function and inactivation.

Here are some pertinent quotes (emphasis is mine):

“[In budding yeast], the dicentrics could be stabilized if one of the centromeres underwent breakage and recombination that physically deleted one centromere.”

In fission yeast:

“Another ~10 % of the dicentrics remained fused, and the cells divided normally. These dicentrics were stabilized because one of the two centromeres had been physically deleted.”

But most importantly in humans:

“The remaining dicentric fusions underwent centromere inactivation between 4 days and 20 weeks after formation. […] Using semi-quantitative FISH, it was observed that the alpha satellite array of the inactive centromere became reduced in size after centromere inactivation. These results suggested that one mechanism of dicentric stabilization and centromere inactivation in humans involves partial deletion of the alpha satellite array.”

As you can see, centromere inactivation is a well studied phenomenon, and in humans in particular, has resulted in the inactivated centromere being partially deleted. This is precisely what we see in human chromosome 2.

DNA Contamination – The Implied Claim

In Jeff Tomkins’ most recent attempt to play down the genetic similarity between humans and chimpanzees he suggests that there is widespread human DNA contamination in the current chimpanzee genome, and that:

“Sequences […] from the seemingly less contaminated data sets indicate that the chimpanzee genome is approximately 85% identical overall to human.”

Analysis of 101 Chimpanzee Trace Read Data Sets: Assessment of Their Overall Similarity to Human and Possible Contamination With Human DNA

We can do some rough calculations here. Let’s say the current chimpanzee assembly contains 90% chimpanzee DNA (which Tomkins claims is only 85% identical to human DNA) and 10% human DNA (which is obviously 100% identical to itself) we can work out an approximate “observed similarity” that we would see if we were to compare this supposedly contaminated chimpanzee genome to the human genome:

(90% x 85%) + (10% x 100%) = 86.5%

We can generalise this formula to work out the “observed similarity” given the relative amounts of chimpanzee DNA and human contaminant that found its way into the assembly:

(X% x 85%) + (Y% x 100%) = Z%


X% = Percentage of chimpanzee DNA
Y% = Percentage of human DNA (equal to 100% - X%)
Z% = Observed similarity

We can then rearrange this formula to work out X% (and therefore Y%) given the “observed similarity”:

X% = (Z% - 100%) / (85% - 100%)

If the “observed similarity” is somewhere around 98%, it follows that the current chimpanzee assembly is actually composed of 87% human DNA. That is clearly absurd.

Even a very conservative “observed similarity” of 95% implies that a full two thirds of the chimpanzee assembly isn’t chimpanzee DNA at all. I’m sure I’m not the only one that thinks this is more than a little far-fetched.

Chromosome 2 Fusion – What do subtelomeres look like?

As usual, we’ll start with a quote from Jeff Tomkins:

Second, the fusion-like sequence was very degenerate and only 70% similar to what one would expect of a pristine fusion sequence of the same size. Even if you assume an evolutionary timeline of up to six million years since the fusion event occurred, the data do not match up with known mutation rates or the variability found in human DNA.

Jeff Tomkins – More DNA Evidence Against Human Chromosome Fusion

The implicit assumption underlying this statement is that Jeff Tomkins believes that the sequences we find at the fusion site were – no more than six million years ago – pristine, perfect telomere repeats, and that they have since mutated into the “degenerate” arrays we see today. This assumption is utterly wrong-headed.

What he is ignoring is the fact that these “degenerate” arrays are found immediately adjacent to the telomeres in virtually all of the human chromosomes. What the chromosome 2 fusion sequence looks like to any reasonable, well-informed person is two chromosomes whose telomeres have been depleted to the point where these subtelomeric “degenerate” arrays are exposed, and the telomeres are no longer protecting the chromosomes from fusion.

First, here are some “degenerate” TTAGGG repeats, found at the “end” of some of our chromosomes:


Now here are some of the reverse motifs – CCCTAA – found at the “beginning” of our chromosomes:


As you can see, these are not perfect telomeric repeats, and it is quite easy to visualise the resulting sequence if one of these “degenerate” forward arrays fused head-to-head with one of these “degenerate” reverse arrays.

It would look something like the picture that Jeff Tomkins himself has provided:


Funny that.