In Dr Tomkins’ November 2016 paper, he claims that there is widespread human DNA contamination in the chimpanzee genome, and that this contamination inflates the overall percentage similarity between the two genomes.
The paper can be found here:
Of particular interest is this quote:
“When blasting chimpanzee trace reads onto an allegedly accurate representation of the chimpanzee genome, one would expect alignment identities of 99.9 to 100%.”
No, this is decidedly not the case, and since Dr Tomkins has previously written on the topic of genome assembly he must know that this is not the case.
When genomes are assembled, there is usually an enormous amount of sequence data used – many times more than the size of the genome itself. So, for example, if the human genome is approximately 3 billion bases and we have obtained 18 billion bases of total sequence, we would say that we have a “6-fold redundant assembly”. What this means is that – on average – every base pair in the genome is covered 6 times.
Why do we need this redundancy? Simply because the individual reads themselves are not “99.9 to 100%” accurate. Genome assembly software attempts to overlap all these individual reads in order to come to a consensus on what the actual sequence is.
So, let’s say that for a particular nucleotide position in the sequence, four of those overlapping reads think that it is an Adenine (an “A”) while the fifth read thinks that it is a Guanine (“G”), and the sixth read thinks it is a Thymine (“T”). The software makes an educated guess that the true nucleotide value at that position is indeed an “A”, and considers the other two reads as having an error at that position.
But in reality, it is much more than a simple “majority rules” decision, because each base in each read also has a corresponding quality score. This quality score (usually referred to as a “Phred score”) is a measure of the confidence of the DNA sequencing machine in recording each base. You can read more about Phred scores (Wikipedia is usually a good starting point https://en.wikipedia.org/wiki/Phred_quality_score#Definition) but for the purposes of this series of posts it is enough to know that a Phred score of 10 is low quality (implies an accuracy of 90%) while a Phred score of 60 is very high quality (implies an accuracy of 99.9999%).
Testing Dr Tomkins’ claim
We have three sources of data that we can use to test Dr Tomkins’ claim of human DNA contamination.
- The current human genome assembly.
- The current chimpanzee genome assembly.
- The raw chimpanzee trace read data sets used by Dr Tomkins.
What should we expect to see if we compare a raw chimpanzee trace read to the human genome assembly? What should we expect to see if we compare a raw chimpanzee trace read to the chimpanzee assembly?
With Dr Tomkins’ quote at the top of this post in mind – is he able to produce a single trace read from the archives that is a “99.9 to 100%” match to the human genome, but a lesser match to the chimpanzee genome? If so, did that read make it into the chimpanzee genome? If human DNA contamination is indeed a problem, then this should be a fairly trivial challenge.