Author: roohif

DNA Contamination – Laying Some Groundwork

In Dr Tomkins’ November 2016 paper, he claims that there is widespread human DNA contamination in the chimpanzee genome, and that this contamination inflates the overall percentage similarity between the two genomes.

The paper can be found here:

Of particular interest is this quote:

“When blasting chimpanzee trace reads onto an allegedly accurate representation of the chimpanzee genome, one would expect alignment identities of 99.9 to 100%.”

No, this is decidedly not the case, and since Dr Tomkins has previously written on the topic of genome assembly he must know that this is not the case.

When genomes are assembled, there is usually an enormous amount of sequence data used – many times more than the size of the genome itself. So, for example, if the human genome is approximately 3 billion bases and we have obtained 18 billion bases of total sequence, we would say that we have a “6-fold redundant assembly”. What this means is that – on average – every base pair in the genome is covered 6 times.

Why do we need this redundancy? Simply because the individual reads themselves are not “99.9 to 100%” accurate. Genome assembly software attempts to overlap all these individual reads in order to come to a consensus on what the actual sequence is.

So, let’s say that for a particular nucleotide position in the sequence, four of those overlapping reads think that it is an Adenine (an “A”) while the fifth read thinks that it is a Guanine (“G”), and the sixth read thinks it is a Thymine (“T”). The software makes an educated guess that the true nucleotide value at that position is indeed an “A”, and considers the other two reads as having an error at that position.
But in reality, it is much more than a simple “majority rules” decision, because each base in each read also has a corresponding quality score. This quality score (usually referred to as a “Phred score”) is a measure of the confidence of the DNA sequencing machine in recording each base. You can read more about Phred scores (Wikipedia is usually a good starting point but for the purposes of this series of posts it is enough to know that a Phred score of 10 is low quality (implies an accuracy of 90%) while a Phred score of 60 is very high quality (implies an accuracy of 99.9999%).

Testing Dr Tomkins’ claim

We have three sources of data that we can use to test Dr Tomkins’ claim of human DNA contamination.

  1. The current human genome assembly.
  2. The current chimpanzee genome assembly.
  3. The raw chimpanzee trace read data sets used by Dr Tomkins.

What should we expect to see if we compare a raw chimpanzee trace read to the human genome assembly? What should we expect to see if we compare a raw chimpanzee trace read to the chimpanzee assembly?

With Dr Tomkins’ quote at the top of this post in mind – is he able to produce a single trace read from the archives that is a “99.9 to 100%” match to the human genome, but a lesser match to the chimpanzee genome? If so, did that read make it into the chimpanzee genome? If human DNA contamination is indeed a problem, then this should be a fairly trivial challenge.

Stay tuned.

Chromosome 2 Fusion – Dicentric Inactivation

From Jeff Tomkins’ most recent paper that supposedly addresses the criticisms leveled at his work by myself and others:

“Another problem with the alleged cryptic centromere is its short length. The cryptic centromere site is extremely small compared to a real centromere — it is only 41,608 bases in length […] – a fraction of the size of human centromeres that range in length between 250,000 and 5,000,000 bases (Aldrup-Macdonald and Sullivan 2014). Thus, if this was in fact a relic centromere of an ancient chromosome fusion, its size should be greater than six times its current length at the minimum.”

From “Debunking the Debunkers: A Response to Criticism and Obfuscation Regarding Refutation of the Human Chromosome 2 Fusion”

No, the only problem here is Jeff Tomkins’ lack of familiarity with the literature and/or his complete unwillingness to research the topic.

In 2012, an excellent review paper discuss centromere inactivation in dicentric chromosomes (i.e. those chromosomes with two centromeres due to a fusion event). You can read the paper here: Dicentric chromosomes: unique models to study centromere function and inactivation.

Here are some pertinent quotes (emphasis is mine):

“[In budding yeast], the dicentrics could be stabilized if one of the centromeres underwent breakage and recombination that physically deleted one centromere.”

In fission yeast:

“Another ~10 % of the dicentrics remained fused, and the cells divided normally. These dicentrics were stabilized because one of the two centromeres had been physically deleted.”

But most importantly in humans:

“The remaining dicentric fusions underwent centromere inactivation between 4 days and 20 weeks after formation. […] Using semi-quantitative FISH, it was observed that the alpha satellite array of the inactive centromere became reduced in size after centromere inactivation. These results suggested that one mechanism of dicentric stabilization and centromere inactivation in humans involves partial deletion of the alpha satellite array.”

As you can see, centromere inactivation is a well studied phenomenon, and in humans in particular, has resulted in the inactivated centromere being partially deleted. This is precisely what we see in human chromosome 2.

DNA Contamination – The Implied Claim

In Jeff Tomkins’ most recent attempt to play down the genetic similarity between humans and chimpanzees he suggests that there is widespread human DNA contamination in the current chimpanzee genome, and that:

“Sequences […] from the seemingly less contaminated data sets indicate that the chimpanzee genome is approximately 85% identical overall to human.”

Analysis of 101 Chimpanzee Trace Read Data Sets: Assessment of Their Overall Similarity to Human and Possible Contamination With Human DNA

We can do some rough calculations here. Let’s say the current chimpanzee assembly contains 90% chimpanzee DNA (which Tomkins claims is only 85% identical to human DNA) and 10% human DNA (which is obviously 100% identical to itself) we can work out an approximate “observed similarity” that we would see if we were to compare this supposedly contaminated chimpanzee genome to the human genome:

(90% x 85%) + (10% x 100%) = 86.5%

We can generalise this formula to work out the “observed similarity” given the relative amounts of chimpanzee DNA and human contaminant that found its way into the assembly:

(X% x 85%) + (Y% x 100%) = Z%


X% = Percentage of chimpanzee DNA
Y% = Percentage of human DNA (equal to 100% - X%)
Z% = Observed similarity

We can then rearrange this formula to work out X% (and therefore Y%) given the “observed similarity”:

X% = (Z% - 100%) / (85% - 100%)

If the “observed similarity” is somewhere around 98%, it follows that the current chimpanzee assembly is actually composed of 87% human DNA. That is clearly absurd.

Even a very conservative “observed similarity” of 95% implies that a full two thirds of the chimpanzee assembly isn’t chimpanzee DNA at all. I’m sure I’m not the only one that thinks this is more than a little far-fetched.

Chromosome 2 Fusion – What do subtelomeres look like?

As usual, we’ll start with a quote from Jeff Tomkins:

Second, the fusion-like sequence was very degenerate and only 70% similar to what one would expect of a pristine fusion sequence of the same size. Even if you assume an evolutionary timeline of up to six million years since the fusion event occurred, the data do not match up with known mutation rates or the variability found in human DNA.

Jeff Tomkins – More DNA Evidence Against Human Chromosome Fusion

The implicit assumption underlying this statement is that Jeff Tomkins believes that the sequences we find at the fusion site were – no more than six million years ago – pristine, perfect telomere repeats, and that they have since mutated into the “degenerate” arrays we see today. This assumption is utterly wrong-headed.

What he is ignoring is the fact that these “degenerate” arrays are found immediately adjacent to the telomeres in virtually all of the human chromosomes. What the chromosome 2 fusion sequence looks like to any reasonable, well-informed person is two chromosomes whose telomeres have been depleted to the point where these subtelomeric “degenerate” arrays are exposed, and the telomeres are no longer protecting the chromosomes from fusion.

First, here are some “degenerate” TTAGGG repeats, found at the “end” of some of our chromosomes:


Now here are some of the reverse motifs – CCCTAA – found at the “beginning” of our chromosomes:


As you can see, these are not perfect telomeric repeats, and it is quite easy to visualise the resulting sequence if one of these “degenerate” forward arrays fused head-to-head with one of these “degenerate” reverse arrays.

It would look something like the picture that Jeff Tomkins himself has provided:


Funny that.

Chromosome 2 Fusion – What should we expect?

Man, if I had a dollar for every time somebody told me that the fusion site doesn’t look like what we would expect it to look like, I’d have twenty-something million dollars. So, I thought I might download some DNA sequences for known mammalian fusions and show you what they look like.

Now the sequences I’m about to show you are from the Indian muntjac species, and if I may quote from a paper published in 2008:

Indian muntjac (Muntiacus muntjak vaginalis) has an extreme mammalian karyotype, with only six and seven chromosomes in the female and male, respectively. Chinese muntjac (Muntiacus reevesi) has a more typical mammalian karyotype, with 46 chromosomes in both sexes.

Comparative sequence analyses reveal sites of ancestral chromosomal fusions in the Indian muntjac genome

So clearly there have been a bunch of fusions in this Indian muntjac species and – luckily for us – they decided to sequence these fusion sites to see what they looked like. Now, remember these are not telomere-to-telomere fusions, these are telomere-to-satellite fusions, so they only correspond to one side of the human chromosome 2 fusion site.

“In chromosome fusion events that occur in nature in living mammals—a very rare event—the DNA signature always involves satDNA producing a DNA signature that occurs as either satDNA-satDNA or satDNA-teloDNA sequence.”

More DNA Evidence Against Human Chromosome Fusion


This is what fusions actually look like.








In the Immortal Words of Dr Tomkins

First, the sequence was only about 800 bases long—not the 10,000 bases or more you would expect if two 5,000-base (or larger) telomeres fused together.

Second, the fusion-like sequence was very degenerate and only 70% similar to what one would expect of a pristine fusion sequence of the same size.

More DNA Evidence Against Human Chromosome Fusion

Who here agrees with him?

Chromosome 2 Fusion – The Cryptic Centromere

This is a brief tutorial on how one goes about demonstrating the existence of a cryptic centromere on human chromosome 2. It is in response to this point from Jeff Tomkins:

“The purported cryptic centromere on human chromosome 2, like the fusion site, is in a very different location to that predicted by a fusion event.”

New Research Undermines Key Argument for Human Evolution

So, first of all I need to mention that Jeff Tomkins implicitly admits that there is such a putative centromere, but his objection is that it is not where it should be. Nevertheless I’ll show you how to find it and then show that it is where it is expected it to be.

So what are we looking for?

“The DNA evidence in question is based on the fact that human, great-ape, and other mammalian centromeres are composed of a highly variable class of DNA sequence that is repeated over and over called alpha-satellite or alphoid DNA. Alphoid DNA, although found in centromeric areas, is not unique to centromeres and is even highly variable between homologous regions throughout the same mammalian genome.”

So basically what we are looking for is a large cluster of these alphoid sequences. As Tomkins states, alphoid sequences are not unique to centromeres, but we shouldn’t find large clusters elsewhere on the chromosome.

BLAST away!

So let’s get a list of all the alphoid sequences that we can find on chromosome 2:

[glenn@macha] cat alphoid.fa
>gi|117911456|emb|CS444613.1| Sequence 51 from Patent WO2006110680

… and then …

[glenn@macha] blastn -query alphoid.fa -subject /Users/glenn/Data/hg19/chr2.fa
 -outfmt '10 sstart send pident nident length evalue' -out alphoid.csv
 -task blastn -dust no -soft_masking false -word_size 7 -evalue 1e-30

This command will search chromosome 2 for anything that looks like an alphoid sequence, and write the results to a file named alphoid.csv, and this is what the file looks like after it has been sorted:


That first field (“sstart” in our command above) is where the matching DNA starts on chromosome 2. So if you look at the file in its entirety, you’ll see that there are 483 matches for this alphoid sequence across chromosome 2, and the vast majority – all but 2 of those 483 matches – are clustered around two locations.

The first location is around the 92Mb mark – and this corresponds to the beginning of the active centromere; the second location is around the 133Mb mark.

Could this be our centromere?

Well it certainly is a cluster of alphoid sequences, but it is in the right place? Let’s have a look at the genes either side of this cluster:


What you should be looking at here are all the genes that precede the cryptic centromere (from PLEKHB2 down to ANKRD30BL) and their corresponding position on chimpanzee chromosome 2B. Now a couple of the corresponding chimpanzee genes are found on scaffolds (the ones beginning with AACZ or GL), but for the genes that have been placed on the chromosome, you can see that they are all around the 132Mb mark.

For the genes on the other side of the cryptic centromere (GPR39 and LYPD1) you’ll notice that the corresponding genes on chimpanzee chromosome 2B are found near the 136Mb mark.

And what pray tell is in that gap between 132Mb and 136Mb on chimpanzee chromosome 2B? The centromere!

To recap

  1. On human chromosome 2 there are two clusters of alphoid sequences.
  2. One of those clusters is the current active centromere.
  3. The other cluster corresponds well to the centromere on chimpanzee chromosome 2B.

I’m gonna say it’s our cryptic centromere …

Chromosome 2 Fusion – It’s a Binding Site. Whoopty-frikkin-do.

Both Jeff Tomkins:

“Clearly, the putative 800 base fusion site is not a degenerate fusion sequence, but a transcriptionally functional and active DNA binding motif read on the minus strand inside the DDX11L2 gene.”

Alleged Human Chromosome 2 “Fusion Site” Encodes an Active DNA Binding Domain Inside a Complex and Highly Expressed Gene—Negating Fusion

… and Cornelius Hunter over at Darwin’s God:

“Genes shouldn’t be there, regardless of expression level, and TFs shouldn’t be binding there.”

The Naked Ape: BioLogos on Human Chromosome Two

… seem to make much of the fact that the fusion site on Human chromosome 2 contains a transcription factor binding site.

The Evidence

In Tomkins’ paper, he posts an image of the UCSC Genome Browser that shows the evidence for transcription factor binding activity at the fusion site. I’ve reproduced the image below, but zoomed in on the relevant sections:


The first section (in red) shows where the 798 base pair fusion site sits on the chromosome. The next section shows the two DDX11L2 transcripts, transcribed from right to left. As you can see, the longer transcript completely encompasses the fusion site. So far, so good.

But what about the green bumps and the grey bars?

The Green Bumps

Without getting into the details of ChIP-seq, the bumps seen here are a good proxy for the relative strength of the binding site. A high peak signifies a strong and/or frequent bond, while a low peak signifies a weak and/or infrequent bond. The particular transcription factor here is named CTCF, and the highest peak we can see in the image above is 0.0222.

But how big is 0.0222? What do we have to compare it to? Back on the Wikipedia page for CTCF, it says that there are “anywhere between 15,000 – 40,000 CTCF binding sites” in the human genome. That would imply that there are somewhere between 1,200 and 3,100 CTCF binding sites on chromosome 2 alone. Maybe the binding site in the fusion sequence is counted among them?

No. Not even close. Fortunately the ENCODE data behind those green bumps is freely available. If you’re really keen, you can download the file here:

(Cell Line = H1-hESC; Antibody Target = CTCF (07-729); View = Peaks)

A cursory glance will tell you that a value of 0.0222 doesn’t even make it into the published data – the minimum value to be counted as a peak by ENCODE is 0.1000. If you’re not good at math, that’s 4.5 times taller than Dr Tomkins’ biggest green bump.

And how many peaks are there on chromosome 2 taller than 0.1000, you ask? Almost 6,000 of them. And how tall are they? Well, if I were to take the 1,000 tallest peaks on chromosome 2, the average height would be 2.1188. Yup, that’s almost one hundred times higher than the tallest peak in the fusion site. Can you see that little red pixel? That’s the binding site.


Kinda puts things into perspective, doesn’t it?

The Grey Bars

This is where things get a lot more interesting. The grey bars at the bottom of the image represent the collated signals for all the different transcription factors. In a similar fashion to the CTCF data above, a black bar represents a strong and/or frequent bond, while a grey bar represents a weak and/or infrequent bond.

If you dig down into the data, you’ll see that the transcription factor that causes the bar to be black is RNA Polymerase II. Great. But what is such a strong signal doing in a region that was supposedly caused by a fusion of sub-telomeric DNA? Both Dr Tomkins and Dr Hunter seem to think that transcription factor binding sites and sub-telomeric DNA are mutually exclusive.

No. No they are not.

Let’s move away from DDX11L2 for a minute and have a look at DDX11L1. It’s at the beginning of chromosome 1.


Would you like to see the DNA from the binding site immediately upstream? Here it is:

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr1:10134-10362

Does it remind you of anything? Telomere repeats, perhaps?

Let’s look at DDX11L5 on chromosome 9. I wonder what the binding site sequence looks like there.

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr9:9965-10327

Yup. Looks awfully like telomere repeats. The transcription factor binding site for DDX11L9 – right at the end of chromosome 15 – looks like this:

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr15:102521116-102521200

Should I go on, or are Dr Tomkins and Dr Hunter willing to concede the point?