Exactly how similar is the human genome to the chimpanzee genome? Are we 99% identical? 97%? 95%? 88%? Or only 70%? And is the “Myth of 1%” actually a myth? Well, yes and no – it depends on how you you calculate your result.
To fully understand the different percentages that are thrown around, you first need to know how the DNA of one species is compared to another species.
BLAST+ is a software package that is commonly used to compare DNA. The way it usually works is that you have a small DNA sequence that the you want to search for inside a larger sequence, be that in the same species or another species. In BLAST+, the small sequence is called the Query Sequence, while the larger sequence – the search space – is called the Subject Sequence. You can also think of the Query Sequence as the “needle” and the Subject Sequence as the “haystack”.
BLAST+ uses some clever programming tricks to find stretches of DNA inside the Subject Sequence that aren’t necessarily identical to the Query Sequence, but are almost identical. For example, let’s say your Query Sequence is:
And somewhere inside your Subject Sequence, you have a perfect alphabet:
BLAST+ is clever enough to work out that the G has been swapped for an S, and the Q has been swapped for an H. It will tell you that it found a match in the Subject Sequence that is 26 letters long, but only 24 of those letters match, which means it is 92.3% identical.
BLAST+ is also able to work out when a letter has been added or removed. If we add a letter to our Query Sequence, then BLAST+ will introduce a space into the Subject Sequence, and align them as follows:
Since the length of the alignment is now 27 letters and not 26 (with 2 differences) the result is now 92.6% identical.
A good method to estimate an overall percentage identity for the human and chimpanzee genomes is to take a very large number of small sequences – chosen at random – from one species, and look for the best match for each sequence in the genome of the other species. If you take an average of all those results, then you should have a fairly reliable figure for the overall percentage identity.
Now we’re armed with some basic knowledge, let’s look at some of the published material.
Jeff Tomkins’ Epic Fail
Back in February of 2013, Jeff Tomkins published a paper claiming that the overall similarity between the DNA of humans and chimpanzees was a lowly 70%. In early 2014, I set out to replicate the results of this study and soon discovered that Tomkins had succumbed to a bug in the BLAST+ software. The bug made itself known when the user submitted a large number of query sequences all at once. If, for example, you submitted 100,000 query sequences that were each 500 bases long, then BLAST+ may have only returned matches for around 75,000 query sequences. However, if you had submitted them one at a time, then you would receive matches for all 100,000 query sequences.
Obviously this has the effect of drastically understating the true percentage identity, and, in a paper I submitted to Answers Research Journal in September 2014, I demonstrated exactly that. I also showed that if you correct for the effects of the bug, that you will get a result of approximately 96.9%.
Tomkins Comes Clean. Sort Of.
Fast forward more than a year to October 2015. After countless attempts to get a response from Jeff Tomkins (via the journal’s editor, Andrew Snelling), he publishes a ‘retraction of sorts’ in which he acknowledges the glitch in the software, and that it affected his results. Rather unsurprisingly though, my paper – which prompted his response paper – was rejected without any reasons given. In this most recent paper, Tomkins uses three different software packages – BLAST+, LASTZ and nucmer – in order to find a consensus between them. Both BLAST+ and nucmer gave him a result around 88%, while the LASTZ result was only 73%. Based on his comments in the paper, even Jeff Tomkins didn’t put much faith in the LASTZ result.
Mind The Gap!
So why did his new BLAST+ analysis give a result of only 88% and not something around 97%? That can be explained by his use of the ungapped parameter of BLAST+. Remember up near the top when I said that BLAST+ was clever enough to work out when a letter was added or removed? Well, you can tell BLAST+ not to do that by adding this ungapped parameter. So, using the same example above, BLAST+ does just fine up to the point at which there is a D where the P should be, but you’ve just told it you are not allowed to put a gap in there, so BLAST+ throws in the towel.
This is fine if you want alignments with no gaps, but the problem comes when Jeff Tomkins comes to calculate a percentage identity for that particular match. Since the query sequence was 27 letters long, and the software gave up after 16 letters (only one of which was different) he chooses to report only a 55.6% identity for that query sequence, intentionally ignoring the 10 identical letters on the other side of the [added | removed] letter.
Tomkins has been aware of this issue since mid-2014, and the fact that he employed the same methodology in his most recent paper, says loud and clear that he is not at all interested in the truth.
But what about the other 88% result?
Yeah I wondered about that as well, so I downloaded the MUMmer package and ran nucmer with exactly the same parameters as Jeff Tomkins did in his paper.
A little bit of background here before I explain what I found. Virtually all geneticists would be aware of the huge amount of repetition in the human genome. These are mostly repeat elements, but there are also quite a lot of genes that have been duplicated over the course of our evolution; be they active genes or pseudogenes. The point is that when you submit a query sequence, often you will get multiple matches across the genome. Generally, the match with the highest percentage similarity is the ‘syntenic match‘ (that is, found in approximately the same location on the corresponding chromosome). If you want to compare apples with apples, then the obvious thing to do is only take the best match into your calculation.
But what did Jeff Tomkins do? He took the average of all the matches. So, if his query sequence contained an Alu repeat motif – of which there are many thousand across the genome – then not only would nucmer return the ‘syntenic match‘, it would also return many hundreds (if not thousands) or poorer matches across the corresponding chromosome. This brings the average similarity down from 97% to around 88%.
This methodology has the absurd consequence that if a human chromosome is compared to the very same human chromosome, then you can conclude that human DNA is only 89% similar to itself!
Once again, Jeff Tomkins is very much aware of this problem, but neither he nor Answers Research Journal have retracted the paper.
Fine, but are we 97% or 99% identical?
Like I said, that depends on how you calculate your result; in particular, how you choose to account for insertions and deletions (‘indels‘). This figure of 99% – which does indeed date back to the 1970’s – does not take indels into account. The difficulty with indels is that they can be anywhere from one nucleotide to several thousand nucleotides long. Should we give each of those thousand nucleotides in a large indel the same weight as a single substitution? Or should we give each indel as a whole the same weight as a single nucleotide substitution, even though the latter occurs much more frequently?
In my results, I tend to take the more conservative approach and effectively treat an indel that is 100 letters long as if it were 100 separate substitutions. If I take this approach, I will get an overall result of around 97%.
If I take into account the relative frequencies of indels to single substitutions – and assume that each indel happened as a single event – then I will support an overall result of around 98%.
If I need to calculate the actual mutation rate – say, for molecular clock calculations – then I will use a figure that excludes indels entirely, and that figure is 99%.
So, no, the 1% is not actually a myth. The figure you use depends very much on the context in which you use it.