Jeff Tomkins’ Orphan Genes – Mind The Gap!

In January 2016, Dr Jeffrey Tomkins posted an article on ICR’s website claiming that the genetic gap between humans and chimpanzees is getting wider.

This time he cites a PLoS paper titled “Origins of De Novo Genes in Human and Chimpanzee”, and makes the following comment:

In yet another recent research report, scientists describe 634 orphan genes in humans and 780 in chimpanzees. In other words, we now have a new set of 1,307 genes that are completely different between humans and chimpanzees.

Now, it’s not the elementary math error that bothers me here (I can’t say I’m surprised any more by how sloppy Tomkins can be) it is the claim that these genes are “completely different” and that they “are found in no other type of creature and therefore have no evolutionary history”.

The authors have kindly made their data available, so I thought I might test Tomkins’ claim, and – as a few people have suggested – I’ll describe the process so that any of you can replicate the results if you are so inclined.

If you’re not so inclined you can probably skip to the end.

Step 1 – Where Are These Genes?

If you scroll down about halfway in the Ruiz-Orera paper, you’ll see links to two GTF files:

For now we’ll just be searching for the human genes; try to find a homolog in the chimpanzee genome. So click on the human link, and download the file – hsa_denovo.gtf.

If you have a look inside these files you’ll see the coordinates for each coding sequence / exon … but no DNA sequence. Sad Panda.

Step 2 – Get The DNA Sequence. Obviously.

Now this part isn’t trivial. I wrote some code in C++ that took this GTF file as input, parsed out the coordinates for each sequence, carved that sequence from the human genome, and then created one FASTA file per chromosome.

If you’re interested in looking at the code, you will find it here:

https://github.com/NambyPamby/MindTheGap

I decided to exclude any sequences shorter than 30 base pairs because short sequences will almost certainly find a match somewhere on the corresponding chimpanzee chromosome, and that would only serve to inflate the final result.

Step 3 – BLAST Away!

For each FASTA file we run a BLAST against the corresponding chimpanzee chromosome – ‘run.sh‘ takes care of this by calling ‘blast_chrome.sh‘ for each FASTA file. Depending on how powerful your machine is, this part might take a few hours. At the end you should have 24 CSV files with the results – one for each of the autosomes, and then two more for the X and Y chromosomes.

Step 4 – Make Sense Of The Results

The ‘analyse.sh‘ script reads in these CSV files and prints out some statistics for each chromosome. The two most important columns are the number of nucleotides I queried and the number of identical nucleotides I actually found. I also count the number of queries I submitted and how many results I got.

Drum Roll, Please.

Who would have thunk it? These human genes which supposedly had no evolutionary history have corresponding sequence in the chimpanzee genome which is – on average – about 95.41% identical.

MindTheGap

But why is this post critical of Tomkins and not of the PLoS paper? I suggest reading the PLoS paper first, then reading Tomkins article, and let me know in the comments just how sloppy Tomkins is in his interpretations and conclusions.

Where did that guy get his data from?

Over at Evolution News and Views, I got a shout out from Ann Gauger with a tone that somewhat suggests that my data might be wrong. This post is for her.

Dear Ann,

BLASTN 2.2.30+


Query= 8 dna:chromosome chromosome:Galgal4:8:17537065:17580564:1

Length=43500

Subject= 1 dna:chromosome chromosome:GRCh38:1:78700000:78800000:-1

Length=100001


 Score =   291 bits (322),  Expect = 7e-79
 Identities = 361/483 (75%), Gaps = 22/483 (5%)
 Strand=Plus/Plus

Query  620    AATTATGAAAGCATACTTTTC-AGTGGTATTCCAGAGAAAGGACTTGCAAGAACTGGAAT  678
              || ||||||||  ||| |||| ||||   ||||| | | || ||||  |||| |||| ||
Sbjct  10940  AAATATGAAAG--TACATTTCTAGTGTATTTCCACA-ACAGTACTTAGAAGACCTGGGAT  10996

Query  679    AAGGATAAGAAGTGAAGTGGAAATTAGTGGTATTGGACCAAAACTTTGTCTTATTAGGGT  738
                  | || |||| |||| ||| | | ||| |  |  |  ||||   || |||||| |||
Sbjct  10997  GTAAACAAAAAGTAAAGTAGAAGTCACTGGCACAGATCTGAAACCAAGTTTTATTAAGGT  11056

Query  739    AAGAAAATTTATTCAATTTGAAAGGTAAAATTCTCTTGATACCAGTTTGTTGGGTTTTTT  798
              |||||||| ||| ||||||||||| |  ||||  |||        |||| |    |||| 
Sbjct  11057  AAGAAAATATATCCAATTTGAAAGCTGGAATTACCTTC-------TTTGATAACGTTTTC  11109

Query  799    TTTTTAAGCTTTTGGGAAGTAATTAAGTTTCATCATATGTTGTGCTTACTCAGGCAGAAT  858
                   ||||||| |  || |||||||||||||||||||||| |||||||||| |||||||
Sbjct  11110  -----AAGCTTTGGACAAATAATTAAGTTTCATCATATGTTTTGCTTACTCATGCAGAAT  11164

Query  859    GTAACTAACACTACTGTTTTTTTATT-CAGTGCTCTAAATTCTATTTG-CACTTT-GCCA  915
              ||||||||  ||  | |||||||| | |||    |||||||| ||||  |||| | ||||
Sbjct  11165  GTAACTAAGTCTTTTTTTTTTTTAATGCAGAAGCCTAAATTCCATTTCACACTGTAGCCA  11224

Query  916    GGTAATTCTCAGCTCAAGCCAACCTTGGGCTTGAAGGATTTCTTCTGCTTTGTGGCCAGG  975
              |  |||||||||||||||| |||||||||||||| |||||||||||||||||||  ||||
Sbjct  11225  GACAATTCTCAGCTCAAGCTAACCTTGGGCTTGAGGGATTTCTTCTGCTTTGTGCTCAGG  11284

Query  976    GAGACAATGGAATGTAATTTGAAATGCACAGTAATTTGTTATTGGATCAATCCAATTGTT  1035
              |||||||| ||   ||||||  ||||||||| | ||| ||    ||||||| ||||||| 
Sbjct  11285  GAGACAATAGAGCATAATTTTGAATGCACAGCAGTTTATTCCAAGATCAATTCAATTGTA  11344

Query  1036   CC-AAACTGTACAACCTAGGATTATTTAATCAACTGATTTCGTAGCCAGCAAACGAAAGG  1094
              || ||||  || ||| |  |||||||||||||||| ||||| ||  ||| |||| ||| |
Sbjct  11345  CCAAAACCATATAACTTTAGATTATTTAATCAACTTATTTCATAAGCAG-AAAC-AAATG  11402

Query  1095   CAA  1097
              |||
Sbjct  11403  CAA  11405


 Score =   223 bits (246),  Expect = 3e-58
 Identities = 287/390 (74%), Gaps = 31/390 (8%)
 Strand=Plus/Plus

Query  24111  TTTTTCTTGTTAAAGGACCAGAGCTGGCTCTTGCAGCCTATTTTTACAGTACCGTGTGAT  24170
              |||||  |||||||  ||||| ||   |||| || |||||||||||   | | || | ||
Sbjct  41719  TTTTTAATGTTAAAACACCAGCGCCAACTCTGGCTGCCTATTTTTATTATGCTGTATAAT  41778

Query  24171  TCTGCAGACATTGACATGTGTCACCTGTGATGCAGCTACATTTGTCG-GCTCTCTGTGCT  24229
              ||  |||  || ||||||||||||||||| |||||  ||||||| |  ||||| ||||||
Sbjct  41779  TCCACAGGTATCGACATGTGTCACCTGTGTTGCAGTCACATTTGGCCAGCTCTTTGTGCT  41838

Query  24230  CAACAGGGAGGAATCGATCTTCTACTTTCATTAGGTGGCAGGAGTAGACTATTGGCATAA  24289
              ||  ||||| | ||| ||||||  ||||||||| |||| || | ||||| ||||||||||
Sbjct  41839  CACTAGGGAAGCATCAATCTTCAGCTTTCATTAAGTGGTAGAAATAGACCATTGGCATAA  41898

Query  24290  AAAAT--ACTAAAAAAAAAAATGGAAAAGAAACCCCAGGGCTTTCTGCTTGGAGAC--CC  24345
              |||||    |  ||||| ||||||   ||||          ||||||| ||| |||  | 
Sbjct  41899  AAAATTATTTTTAAAAATAAATGG--GAGAA---------TTTTCTGCCTGGGGACTACA  41947

Query  24346  AGACTGCTGTTCAGTGGTCATTTGAATTATTGAATGGGATTAAAATAAAAGCCATTTC--  24403
                ||| ||||||  | || ||||||||||||||||||  |  ||||||||||||||||  
Sbjct  41948  CCACTACTGTTCTCTAGTAATTTGAATTATTGAATGGACT--AAATAAAAGCCATTTCTA  42005

Query  24404  --CTTTTT----TTTGTCCCTTTACTCAGATGAGCCATCTGAAATGCAAGTTGATTTGT-  24456
                ||||||    ||| |   ||| | ||||  |   | ||||||||||||||||||| | 
Sbjct  42006  TTCTTTTTATTCTTTTTTTTTTTGCCCAGACAA---ACCTGAAATGCAAGTTGATTTTTT  42062

Query  24457  -ATTTTCTTTTATCCCCTCAGCTTGTTAGG  24485
               |   |||||||||||||||||||||||||
Sbjct  42063  AAAAATCTTTTATCCCCTCAGCTTGTTAGG  42092


 Score =   219 bits (242),  Expect = 4e-57
 Identities = 269/362 (74%), Gaps = 19/362 (5%)
 Strand=Plus/Plus

Query  41677  AAAATTATGAAAGCTTTACCCATTCTCATTATGCAAACTTAATCTAAATGGATGTCCTAA  41736
              ||||| | | ||||||| | ||||||  ||| |||||| ||||||||| | | |  ||||
Sbjct  84967  AAAATCAGGGAAGCTTTGCACATTCTAGTTACGCAAACGTAATCTAAACGAAGGCTCTAA  85026

Query  41737  TATTCTTCCAGATGCAACGATAAACCTCCAACTATCTAAGAATATTTATTGGGAGGATGA  41796
              || |||||   || |   |||||||||||||  |  | ||||| |||||||| |||||||
Sbjct  85027  TACTCTTCTCTATACCTTGATAAACCTCCAAACAGTTGAGAATTTTTATTGGAAGGATGA  85086

Query  41797  GTATTATTATTCCATTTAGATTATTTCAGCATTAAGGGATATGGCTTATTCAAGCTGCTG  41856
               |  |||||||||||||||           ||||| |||||||||||||||| ||||| |
Sbjct  85087  ATCATATTATTCCATTTAG-----------ATTAAAGGATATGGCTTATTCAGGCTGCAG  85135

Query  41857  TTGATAAAGCAATGTGGTAAGGTTAATACTGCAACTGAC-AATGCTCTGCCAGATTTCAC  41915
               | |||||||   || || || ||||| ||  ||||||| |||||  |||||||||| | 
Sbjct  85136  ATAATAAAGCCGCGTAGTCAGCTTAATGCTAAAACTGACAAATGCGGTGCCAGATTTGAA  85195

Query  41916  AATATATGGCAAACTTTAATTAGAAGTTTATGAACCTCTGAAAATTCTCCGAAGGGCTTA  41975
              || |||  | ||| ||||||| ||||||||||||  ||||||||  |||| ||||| |||
Sbjct  85196  AACATACAGGAAATTTTAATTGGAAGTTTATGAAAGTCTGAAAA-ACTCCAAAGGGTTTA  85254

Query  41976  TCCTTCAGGATGAACTT-CGACAAA--TAGTCAGCTGAAATATGCAGTGATATGCA-GCA  42031
              |||||    || |||||   |||||  |||  |||| | ||||    ||||||||| |||
Sbjct  85255  TCCTTAGAAATAAACTTAAAACAAAATTAG--AGCTAATATATTAGCTGATATGCAGGCA  85312

Query  42032  AT  42033
              ||
Sbjct  85313  AT  85314


 Score =   102 bits (112),  Expect = 7e-22
 Identities = 219/321 (68%), Gaps = 40/321 (12%)
 Strand=Plus/Plus

Query  42618  AATGCATTATGTACAGTCTGCACTGCTTAATAAATATGTTGTTCATTAAATAAGTATTCA  42677
              ||||||||||||||||||||   ||||||||||||   ||  |||| |||||   |||| 
Sbjct  85882  AATGCATTATGTACAGTCTG---TGCTTAATAAATGCATTACTCATAAAATATACATTCG  85938

Query  42678  TGTGGTCTCCCTCTTTATTTTTCCATATCAACAAAACAATCAGACCAACTATAATATTAT  42737
                 |  ||||          || | |||||| | | ||||||  |  |  |||| |||||
Sbjct  85939  CAAGTCCTCC----------TTTCTTATCAATAGAGCAATCAAGCTGAGCATAAGATTAT  85988

Query  42738  CCAGAAATTCTGCTTCTTTTTAT-CTGAAATATAATTATAGCAGTCCTCTCTTAAAATTA  42796
                ||||||||  || |||  ||| |  ||||||||| | |  | |||||| ||||||| |
Sbjct  85989  TGAGAAATTCAACTCCTTCCTATACAAAAATATAATCACATTA-TCCTCTTTTAAAATCA  86047

Query  42797  TGTTCACTAGGTGATGAAAGGAAA----ACATGATTACAGCTACTGCTAACATTCCATTG  42852
              || |||      ||||||| ||||    | ||| ||  ||||     ||| ||| |||||
Sbjct  86048  TGCTCA------GATGAAAAGAAACAAGAGATGGTT--AGCT-----TAATATTTCATTG  86094

Query  42853  TGTAAAGAATTTCATATTTAGCATACTAAAGACACAGTAAGCATTTGTTTTCTTTTATGT  42912
              |  |||| |||| ||||||| | |||| || | ||  ||  ||||| |||||||| ||||
Sbjct  86095  TAAAAAGCATTTTATATTTAACCTACTCAAAATAC--TA--CATTT-TTTTCTTTCATGT  86149

Query  42913  TTTCTGGTA---AAGAGAAAA  42930
              || |  | |   |||||||||
Sbjct  86150  TTCCCAGCATAGAAGAGAAAA  86170


 Score = 57.2 bits (62),  Expect = 3e-08
 Identities = 45/54 (83%), Gaps = 0/54 (0%)
 Strand=Plus/Plus

Query  21281  AATACACTCTGGAAAACTATTGTAGCCAGATGCCAAACAAATGAAAACAGATGG  21334
              |||| || |||  |||| |||||| | | || ||||||||||||||||||||||
Sbjct  40402  AATATACCCTGTGAAACAATTGTAACTAAATACCAAACAAATGAAAACAGATGG  40455


 Score = 51.8 bits (56),  Expect = 1e-06
 Identities = 173/264 (66%), Gaps = 14/264 (5%)
 Strand=Plus/Plus

Query  19077  CATTTCATAAGTTGGTATTATAGAATATGACTG-AATGTAAATGATATAATAGTAGCAAT  19135
              ||||  || || || || || ||||| | |  | |||| ||||||| |    |  |||||
Sbjct  36099  CATTAGATGAGCTGATACTACAGAATGTTAAAGGAATGCAAATGATCTGGCTGGTGCAAT  36158

Query  19136  AAGAAAAATAGCATAGTCACTGCGCATTGAGCA--GGTACTTATAATTTTGCCAATTAAT  19193
                ||||||||  ||| |  |   | | ||| ||  || |||  ||||| ||    |||||
Sbjct  36159  TGGAAAAATATTATAATTTCCATGAAATGAACAAAGGGACTC-TAATTATGTATGTTAAT  36217

Query  19194  AGTTCAAAAAGGCAGAATATTCTGTTTATGGCTGTTTTATAATATTGGTTTTGTAGTTGA  19253
              ||  |||| | ||| | | ||||| |||||   |||| || | ||  ||||| | |||||
Sbjct  36218  AGGACAAAGATGCACAGTGTTCTGCTTATGATGGTTTAATGACATCAGTTTTATGGTTGA  36277

Query  19254  TTTTTCATATA-ATCTTGTATCAG--------TTGTGTTATGCAATTAGTAAACTGATAC  19304
                ||||  | |  |||  ||||||        |  | ||| |||||||||||  |||| |
Sbjct  36278  GATTTCCAACATTTCTCATATCAGACTTTATATCATATTAGGCAATTAGTAACTTGATTC  36337

Query  19305  AATCTGCA-AATGTGGCTTTAAAA  19327
              |  ||||| |  |||| |||||||
Sbjct  36338  ACACTGCACAGGGTGGTTTTAAAA  36361

You’re welcome.

Chromosome 2 Fusion – Dead in a Day?

Ian Juby has stated explicitly that DDX11L2 is “critical for life” both here and here, where he refers to a paper written by Dr Jeffrey Tomkins in 2013.

But just how important is this DDX11L2 gene anyway? Well, we can get some clues just by looking at the name.

A rose by any other name?

Let’s break down the name of the gene itself – DDX – 11 – L – 2

DDX is short for “DEAD Box” – which is an RNA Helicase gene. Helicase genes are incredibly important, since these are the machines that unwind the DNA or RNA so that other molecular machines can read the underlying information.

There is an enormous family of DDX genes across our genome – DDX1, DDX2A, DDX2B, DDX3X, DDX3Y, DDX4, DDX5, DDX6, DDX10, DDX11, DDX17, DDX18, DDX19A, DDX19B, DDX20, DDX21, DDX23, DDX24, DDX25, DDX27, DDX28, DDX31, DDX39A, DDX39B, DDX41, DDX42, DDX43, DDX46, DDX47, DDX48, DDX49, DDX50, DDX51, DDX52, DDX53, DDX54, DDX55, DDX56, DDX58, DDX59 and DDX60.

Yup, that’s a grand total 41 DEAD Box Helicase protein-coding genes in the human genome.

DDX11 is just one of those 41 DEAD Box Helicase protein-coding genes. There’s nothing obviously special about DDX11 that makes it stand out from all the other DDX genes.

L stands for “Like”. Back in 2009, Valerio Costa reported on a transcripts family with 18 members whose nucleotide sequence bears a strong resemblance to DDX11. In other words, it looks “Like” DDX11, but it does not code for a protein – it’s a pseudogene. Note that all of these sequences are found in subtelomeric regions. Except for one. Can you guess which one?

2. Yup, DDX11L2 isn’t found near telomeres in the modern human genome, it’s all by itself in the middle of chromosome 2 – make of that what you will. To break things down even further, there are actually two transcripts for this pseudogene – NR_024004.1, which is the longer of the two transcripts, and straddles the putative fusion site – and NR_024005.2, the shorter transcript which does not cross the fusion site. Even if we hypothetically split chromosome 2 at the fusion point, the DDX11L2 pseudogene would still exist. All we would lose is this longer transcript.

So just to recap, what we are talking about here is a solitary alternately-spliced transcript from a pseudogene that is fairly similar to 17 other pseudogenes, and those pseudogenes as a whole bear some resemblance to an actual protein-coding gene – DDX11 – which is itself part of a larger family of 41 genes.

Now we have a rough idea of where this DDX11L2 pseudogene sits in the scheme of things, let’s talk hard numbers.

Droppin’ English. Express Yourself.

Gene expression data is often a good guide to the relative importance of a particular gene or transcript. If you had a bacterial infection, would you go to the doctor to get some antibiotics (with enough penicillin molecules to go around killing off all the bacteria) or would you go the homoeopathic option, where the solution has been diluted so many times that there is virtually no active ingredient left?

So, let’s look at how frequently DDX11L2 is expressed. If If you click on this link you’ll see that cells in the testes express DDX11L2 at a rate of 3.905 RPKM (Reads per Kilobase per Million mapped reads), in the pituitary gland at 0.936 RPKM, in the prostate at 0.893 RPKM and in the spleen at 0.882 RPKM.

If you then go up a step and look a the expression data for DDX11 itself, you’ll see that in the testes it is expressed at a rate of 8.477 RPKM, in the pituitary gland at 8.068 RPKM, in the prostate at 9.212 RPKM and in the spleen at 10.047 RPKM.

Already we can see that the DDX11 gene expression levels dwarf those of DDX11L2, but don’t forget we have 40 other DDX genes being transcribed as well! Let’s look at some other DDX genes:

                  DDX1    DDX3X    DDX5    DDX6    DDX11  DDX11L2
Testes           33.980  50.016  138.808   8.816   8.477    3.905
Pituitary Gland  35.394  34.213  321.470   7.100   8.068    0.936
Prostate         24.214  36.621  240.220   9.998   9.212    0.893
Spleen           23.889  50.052  347.019  10.811  10.047    0.882

Wow, DDX5 is expressed almost 400 times more often in the spleen that DDX11L2!

But remember how I said there were two transcripts? Well, the expression data above are from GTEx, and according to the locus given for DDX11L2, it only gives data for the shorter transcript – the one that does not overlap the fusion site.

If we look at AceView we can actually get a breakdown of how frequently the introns are sequenced in RNA-seq studies. The intron that corresponds to the fusion site was sequenced 682 times, while the intron common to both transcripts was sequenced a total of 3,186 times, implying that the longer transcripts make up only around 21.4% of the total DDX11L2 transcripts.

So, what are we to make of these numbers?

When Tomkins claims that DDX11L2 is a “highly expressed gene“, we have to ask “highly expressed relative to what exactly?” According to the AceView link above, this gene is expressed at “only 26.8% of the average gene“. If you then take into account the fact that the transcript that spans the fusion site makes up only 21.4% of transcripts for this gene, then it is expressed only 5.7% as frequently as an average gene.

If you were to hypothetically split this chromosome in half at the fusion site, you wouldn’t be “dead in a day” as Ian Juby likes to say, you would just lose a very lowly expressed transcript of a pseudogene. That pseudogene is part of a family of 17 other similar pseudogenes (DDX11L), which as a group, bear some resemblance to an actual protein-coding gene (DDX11). That protein-coding gene is then part of a much larger group of protein-coding genes (DDX).

Chromosome 2 Fusion – The Low Hanging Fruit

Jeffrey Tomkins has written a number of articles in previous years that attempt to cast doubt on the claim that human chromosome 2 is the result of a head-to-head fusion of two ancestral chromosomes.

For the sake of brevity, I will address only a few of the more egregious errors that Dr Tomkins made in his articles; I will address the others when I have the time and the inclination.

Comparative Scale, huh.

In this article Dr Tomkins posts a diagram of the fusion supposedly drawn to scale. The desired effect here is obviously to have people believe that the human chromosome 2 doesn’t align to its chimpanzee counterparts. Here is Dr Tomkins’ diagram:

TomkinsToScale

And here is my diagram:

FusionToScale

The difference is that the PostScript code used to produce my diagram is freely available, and from that you are able to verify the genome coordinates I have used.

Dr Tomkins also claims here that the combined chimpanzee chromosomes are some 10% larger than the human chromosomes. However, according to the most recent chimpanzee assembly – named “panTro4” – the combined length of chimpanzee chromosomes 2A and 2B is 247.5Mbp while human chromosome 2 is 243.2Mbp. This is a difference of only 4.3Mbp, or 1.8%. It should be noted that the centromeres in the chimpanzee assembly are manually placed on the chromosome and are of an arbitrary fixed length of 3Mbp. This introduces some uncertainty in the true length of combined chimpanzee chromosomes. Human centromeres are known to range in length from 0.3Mbp to 5.0Mbp, and if the centromere on chimpanzee chromosome 2B is (in reality) at the lower end of that range, then the size difference would be easily less than 1%.

What should we see at the fusion site?

Tomkins predicts that under the fusion model “thousands of intact TTAGGG motifs in tandem should exist” yet he is fully aware that the function of telomeres is to prevent such fusions in the first place. To expect thousands of intact telomere motifs at the fusion site is to expect that intact telomeres somehow failed.

The most parsimonious explanation is that the telomeres were already missing (or severely shortened) and this allowed the fusion to occur. Lab experiments where components of the telomere nucleoprotein complex have been disabled demonstrate the ease at which head-to-head fusions occur when the telomeres are depleted.

Moteefs y’all

Jeffrey Tomkins says explicitly that forward telomere motifs (‘TTAGGG’) should only be found on the left side of the fusion site, and reverse telomere motifs (‘CCCTAA’) should only be found on the right of the fusion site.

Yet by my calculations, any 6 base pair sequence should occur entirely by chance approximately every 4,096 base pairs (4 ^ 6 = 4,096). Dr Tomkins produced a table showing the breakdown of forward and reverse motifs found both to the left and to the right of the fusion site.

On the RP11-395L14 BAC used by Tomkins, there are 108,569 base pairs to the left of the fusion site. Mathematically, I would expect 26 reverse telomeres to be present while Dr Tomkins expects zero – there are 18. To the right of the fusion site are 68,167 base pairs. I would expect 17 forward telomere motifs, Dr Tomkins expects zero – there are 18.

In your hands

As indicated by the title, these are just the low-hanging fruit; claims made by Tomkins that can be addressed quite easily. Some of the other claims are a little more complex and/or take more time to address. If there is a particular claim that anyone would like me to address, please let me know in the comments.

 

Is 1% a myth?

Exactly how similar is the human genome to the chimpanzee genome? Are we 99% identical? 97%? 95%? 88%? Or only 70%? And is the “Myth of 1%” actually a myth? Well, yes and no – it depends on how you you calculate your result.

To fully understand the different percentages that are thrown around, you first need to know how the DNA of one species is compared to another species.

BLAST+

BLAST+ is a software package that is commonly used to compare DNA. The way it usually works is that you have a small DNA sequence that the you want to search for inside a larger sequence, be that in the same species or another species. In BLAST+, the small sequence is called the Query Sequence, while the larger sequence – the search space – is called the Subject Sequence. You can also think of the Query Sequence as the “needle” and the Subject Sequence as the “haystack”.

BLAST+ uses some clever programming tricks to find stretches of DNA inside the Subject Sequence that aren’t necessarily identical to the Query Sequence, but are almost identical. For example, let’s say your Query Sequence is:

ABCDEFSHIJKLMNOPHRSTUVWXYZ

And somewhere inside your Subject Sequence, you have a perfect alphabet:

ABCDEFGHIJKLMNOPQRSTUVWXYZ

BLAST+ is clever enough to work out that the G has been swapped for an S, and the Q has been swapped for an H. It will tell you that it found a match in the Subject Sequence that is 26 letters long, but only 24 of those letters match, which means it is 92.3% identical.

BLAST+ is also able to work out when a letter has been added or removed. If we add a letter to our Query Sequence, then BLAST+ will introduce a space into the Subject Sequence, and align them as follows:

ABCDEZGHIJKLMNOPDQRSTUVWXYZ
ABCDEFGHIJKLMNOP_QRSTUVWXYZ

Since the length of the alignment is now 27 letters and not 26 (with 2 differences) the result is now 92.6% identical.

A good method to estimate an overall percentage identity for the human and chimpanzee genomes is to take a very large number of small sequences – chosen at random – from one species, and look for the best match for each sequence in the genome of the other species. If you take an average of all those results, then you should have a fairly reliable figure for the overall percentage identity.

Now we’re armed with some basic knowledge, let’s look at some of the published material.

Jeff Tomkins’ Epic Fail

Back in February of 2013, Jeff Tomkins published a paper claiming that the overall similarity between the DNA of humans and chimpanzees was a lowly 70%. In early 2014, I set out to replicate the results of this study and soon discovered that Tomkins had succumbed to a bug in the BLAST+ software. The bug made itself known when the user submitted a large number of query sequences all at once. If, for example, you submitted 100,000 query sequences that were each 500 bases long, then BLAST+ may have only returned matches for around 75,000 query sequences. However, if you had submitted them one at a time, then you would receive matches for all 100,000 query sequences.

Obviously this has the effect of drastically understating the true percentage identity, and, in a paper I submitted to Answers Research Journal in September 2014, I demonstrated exactly that. I also showed that if you correct for the effects of the bug, that you will get a result of approximately 96.9%.

Tomkins Comes Clean. Sort Of.

Fast forward more than a year to October 2015. After countless attempts to get a response from Jeff Tomkins (via the journal’s editor, Andrew Snelling), he publishes a ‘retraction of sorts’ in which he acknowledges the glitch in the software, and that it affected his results. Rather unsurprisingly though, my paper – which prompted his response paper – was rejected without any reasons given. In this most recent paper, Tomkins uses three different software packages – BLAST+, LASTZ and nucmer – in order to find a consensus between them. Both BLAST+ and nucmer gave him a result around 88%, while the LASTZ result was only 73%. Based on his comments in the paper, even Jeff Tomkins didn’t put much faith in the LASTZ result.

Mind The Gap!

So why did his new BLAST+ analysis give a result of only 88% and not something around 97%? That can be explained by his use of the ungapped parameter of BLAST+. Remember up near the top when I said that BLAST+ was clever enough to work out when a letter was added or removed? Well, you can tell BLAST+ not to do that by adding this ungapped parameter. So, using the same example above, BLAST+ does just fine up to the point at which there is a D where the P should be, but you’ve just told it you are not allowed to put a gap in there, so BLAST+ throws in the towel.

This is fine if you want alignments with no gaps, but the problem comes when Jeff Tomkins comes to calculate a percentage identity for that particular match. Since the query sequence was 27 letters long, and the software gave up after 16 letters (only one of which was different) he chooses to report only a 55.6% identity for that query sequence, intentionally ignoring the 10 identical letters on the other side of the [added | removed] letter.

Tomkins has been aware of this issue since mid-2014, and the fact that he employed the same methodology in his most recent paper, says loud and clear that he is not at all interested in the truth.

But what about the other 88% result?

Yeah I wondered about that as well, so I downloaded the MUMmer package and ran nucmer with exactly the same parameters as Jeff Tomkins did in his paper.

A little bit of background here before I explain what I found. Virtually all geneticists would be aware of the huge amount of repetition in the human genome. These are mostly repeat elements, but there are also quite a lot of genes that have been duplicated over the course of our evolution; be they active genes or pseudogenes. The point is that when you submit a query sequence, often you will get multiple matches across the genome. Generally, the match with the highest percentage similarity is the ‘syntenic match‘ (that is, found in approximately the same location on the corresponding chromosome). If you want to compare apples with apples, then the obvious thing to do is only take the best match into your calculation.

But what did Jeff Tomkins do? He took the average of all the matches. So, if his query sequence contained an Alu repeat motif – of which there are many thousand across the genome – then not only would nucmer return the ‘syntenic match‘, it would also return many hundreds (if not thousands) or poorer matches across the corresponding chromosome. This brings the average similarity down from 97% to around 88%.

This methodology has the absurd consequence that if a human chromosome is compared to the very same human chromosome, then you can conclude that human DNA is only 89% similar to itself!

Once again, Jeff Tomkins is very much aware of this problem, but neither he nor Answers Research Journal have retracted the paper.

Fine, but are we 97% or 99% identical?

Like I said, that depends on how you calculate your result; in particular, how you choose to account for insertions and deletions (‘indels‘). This figure of 99% – which does indeed date back to the 1970’s – does not take indels into account. The difficulty with indels is that they can be anywhere from one nucleotide to several thousand nucleotides long. Should we give each of those thousand nucleotides in a large indel the same weight as a single substitution? Or should we give each indel as a whole the same weight as a single nucleotide substitution, even though the latter occurs much more frequently?

In my results, I tend to take the more conservative approach and effectively treat an indel that is 100 letters long as if it were 100 separate substitutions. If I take this approach, I will get an overall result of around 97%.

If I take into account the relative frequencies of indels to single substitutions – and assume that each indel happened as a single event – then I will support an overall result of around 98%.

If I need to calculate the actual mutation rate – say, for molecular clock calculations – then I will use a figure that excludes indels entirely, and that figure is 99%.

So, no, the 1% is not actually a myth. The figure you use depends very much on the context in which you use it.

Questions Anyone? Class Dismissed.

How big is the chimpanzee genome?

Jeff Tomkins has published another paper in Answers Research Journal on the DNA similarity between humans and chimpanzees. Now there is quite the story behind this paper, and I’m sure I’ll address that in another post. For now, I just want to concentrate on the very last paragraph:

“And second, the majority of flow cytometry studies of chimpanzee nuclei along with the cytogenetic analysis of chromosomes indicate a genome size difference of about 8%, with the chimpanzee genome having a significantly larger amount of heterochromatic DNA compared to human (Formenti et al. 1983; Pellicciari et al. 1982, 1988, 1990a, 1990b; Seuanez et al. 1977).”

I’ve always been a bit skeptical of claims that the chimpanzee genome is bigger than the human genome, simply because I have the latest versions of the genomes on my hard drive. A chromosome by chromosome breakdown of each genome shows that it is actually the chimpanzee genome is slightly smaller than the human genome. There is also a handy website – http://www.genomesize.com – that collects references for the sizes of hundreds of genomes.

It’s important to point out that the DNA content of organisms – although measured in picograms – is very rarely “weighed” in the sense of putting it on a set of scales. The method used in these studies involves staining the DNA with a special compound that glows when excited by ultra-violet light. These studies are really only reporting how much a certain sample glows more than or less than a given standard. Older studies tended to use a value for the human genome of 7.30pg to calibrate their results, while newer studies tend to use a value of 7.00pg.

The citation given on this website for the standard human genome is from a paper in 2005 written by Awtar Krishan et al. This study measured the genome size of a whole bunch of animals from the Miami Metro Zoo and used two samples from a human male to calibrate the results. There were three chimpanzee samples measured: 7.22pg (for the female), and 6.77pg (for two males).

So, straight away we can say that at least in this study, the chimpanzee male genome is about 3.4% smaller than the human male genome. And since we know roughly how big the X and Y chromosomes are in humans (156 million base pairs and 57 million base pairs respectively), and how much they would weigh in picograms, we can calculate the weight of a human female diploid cell: 7.10pg. So it appears that the female chimpanzee genome is about 1.7% larger than the human female genome.

So let’s look at the papers that Tomkins cites to support that final paragraph:

Formenti et al. 1983

The actual title of this paper is “Variazioni del Contenuto Nucleare in DNA Negli Hominoidea” and was published in the Italian-language journal “Antropologia Contemporanea“. A search online for the abstract for this paper yielded precisely zero results. The paper is cited by genomesize.com as giving a result of 3.63pg, but we do not know if this was for a male or a female chimpanzee. If it was for a male, then the chimpanzee is 3.7% larger. If it is for a female, then the chimpanzee is only 2.2% larger. Given our result above, it seems a little more likely than not that this was from a female chimpanzee.

Pellicciari et al. 1982

This paper can be found here, although unfortunately for most lay-people, it is behind a paywall.

The chimpanzee figure given in this paper is 8.03pg against a standard human genome size of 7.30pg, which is a full 10% larger. But things get a little interesting when you consider:

“Table 2 reports, for each species examined, the Feulgen-DNA contents (in pg and in percentage as compared to man) and the morphological data of the karyotype (chromosome number, 2n and fundamental number, FN) drawn from the literature (De Boer, 1974; Chiarelli et al., 1979). Data so far unpublished are marked with an asterisk. Previously published data have been recalculated on the basis of the Feulgen-DNA content of the control species included in the corresponding lots of preparations.”

And since chimpanzee is one of the entries not marked with an asterisk, the authors appear to be citing a previously published figure, presumably from Chiarelli et al., 1979 (Comparative Karyology of Primates). Delving into this source, we see that it is actually a symposium, featuring previously published papers:

“A general account of the literature available on primate chromosomes up to 1972 has been collected in the last part of this book.” (p27).

Taken in combination with the following quote from the original paper (Pellicciari et al., 1982):

“Finally, in Hominoidea (Fig. 1 e), Pongidae have a variable Feulgen-DNA content, the value being higher than in man. This last finding, previously published by us (Manfredi Romanini, 1972) has been recently confirmed by Seuanez et al. (1977) by microinterferometric studies on spermatozoa, even though the content sequence observed by this author is different from ours (Homo < Gorilla < Pan < Pongo in Manfredi Romanini, 1972; Homo < Pan < Pongo < Gorilla, in Seuanez et al., 1977).”

… and it appears that the original source for this figure actually dates back to 1972. Chasing that source down, we see the following results (in arbitrary units, which are then scaled to picograms using the standard human genome size at the time):

Pan troglodytes: 13.60 ± 4.60; Homo sapiens:  12.36 ± 0.11

These standard deviations are enormous! On these numbers, it’s possible that the chimpanzee genome could be 27% smaller than the human genome.

Pellicciari et al. 1988

This abstract can be found on PubMed, but I cannot find the full text online anywhere. In regards to Tomkins’ citation of it, the title is already a huge cause for concern: “Genome size and constitutive heterochromatin in Hylobates muelleri and Symphalangus syndactylus and in their viable hybrid“. The title makes no mention of measuring chimpanzee DNA, and nor does the abstract: “Genome size was measured […] in six species of the family Hylobatidae and in a hybrid of the gibbon (Hylobates muelleri) and siamang (Symphalangus syndactylus)“. However, genomesize.com comes to the rescue, and according to its citation of it, the figure used in this paper was 3.63pg (against a human standard of 3.50pg). Given that it is near certain that no new measurement of the chimpanzee genome took place in this study, it seems more likely that this paper cites a previously published figure – and based on a comparison of the authors in both papers, it’s likely they are citing Formenti et al., 1983.

Pellicciari et al. 1990a

This paper is also behind a paywall, but the abstract can be found here. This paper uses flow cytometry to measure the genome size and the following results were obtained:

Pan troglodytes: 7.85pg ± 0.40pg; Homo sapiens: 7.30pg ± 0.35pg

So again we have reasonably large standard deviations in the samples, but taken at face value, the chimpanzee genome is 7.5% larger than the human genome.

Pellicciari et al. 1990b

This paper is also on PubMed here, and again from the abstract, it is quite clear that this paper does not perform a new measurement of the chimpanzee genome. “Measurements were performed by microfluorometry on […] man, gorilla and mouse“. Heterochromatic DNA was also measured in “man, gorilla and mouse“. Karyotypes were stained in “man, gorilla and mouse“.

Seuanez et al. 1977

Again, this paper is behind a paywall. This paper does not give any quantitative measurements, however, here I have reproduced an original graph from this paper. The method of measuring genome size here involves weighing the spermatozoa – “Total Dry Mass” (“TDM”), extracting the DNA, and then weighing the remainder – “Dry Mass After Extraction” (“DMAE”). The difference of course being the “Dry Mass of Extracted DNA” (“DNA-DM”).
Seuanez

In this graph, the small circles represent human DNA, while the small squares represent chimpanzee DNA. As you can see, one human data point is clearly higher than both chimpanzee data points and the other human data point is clearly lower than both chimpanzee data points. Given such high variance in the results and the fact that the authors did not publish the actual figures behind he graph, it is difficult to draw anything useful from this paper (other than the fact it conflicts significantly with Pellicciari et al., 1982‘s ordering of primate genome sizes, and could then count as evidence that the true measurement taken in that paper would be at the lower end of the range).

So where does this leave us?

Tomkins makes an impressive list of six papers in support of his position. Two of these can be discarded instantly – Pellicciari et al., 1988 and Pellicciari et al., 1990b – simply because they are obviously referring to previously published results rather than taking new measurements.

We have Pellicciari et al., 1982, for which the primary source is actually a paper from 1972, which claims a 10% size difference, but with an enormous standard deviation – so high that the human genome could quite easily be larger than the chimpanzee genome. Then we have Seuanez et al., 1977, from which no actual figures can be drawn, however it does highlight just how much variance is evident in previous results. Following that is Formenti et al., 1983, which – at least according to genomesize.com – does not support an 8% difference, but perhaps only a 2%-3% difference. And the last of these four is Pellicciari et al., 1990a, which supports a 7.5% difference, but also has quite a large standard deviation.

Contract this with the most recent study – Krishan et al., 2005 – which shows that the genomes are approximately the same size, and it’s difficult to see how Dr Tomkins can claim a majority in any sense of the word.

Without good reason to the contrary, I’m inclined to give more weight to more recent studies, and give less weight to older studies – particularly those with such enormous variance in their stated results.

The Cytochrome B “equidistance problem”

  • A represents the split between Fungi and Animalia
  • B represents the split between Ecdysozoa and Deuterostomia
  • C represents the split between Echinodermata and Chordata
  • D represents the split between Reptilia and Mammalia
  • E represents the split between Carnivora and Primates