Chromosome 2 Fusion – What should we expect?

Man, if I had a dollar for every time somebody told me that the fusion site doesn’t look like what we would expect it to look like, I’d have twenty-something million dollars. So, I thought I might download some DNA sequences for known mammalian fusions and show you what they look like.

Now the sequences I’m about to show you are from the Indian muntjac species, and if I may quote from a paper published in 2008:

Indian muntjac (Muntiacus muntjak vaginalis) has an extreme mammalian karyotype, with only six and seven chromosomes in the female and male, respectively. Chinese muntjac (Muntiacus reevesi) has a more typical mammalian karyotype, with 46 chromosomes in both sexes.

Comparative sequence analyses reveal sites of ancestral chromosomal fusions in the Indian muntjac genome

So clearly there have been a bunch of fusions in this Indian muntjac species and – luckily for us – they decided to sequence these fusion sites to see what they looked like. Now, remember these are not telomere-to-telomere fusions, these are telomere-to-satellite fusions, so they only correspond to one side of the human chromosome 2 fusion site.

“In chromosome fusion events that occur in nature in living mammals—a very rare event—the DNA signature always involves satDNA producing a DNA signature that occurs as either satDNA-satDNA or satDNA-teloDNA sequence.”

More DNA Evidence Against Human Chromosome Fusion

Behold!

This is what fusions actually look like.

http://www.ncbi.nlm.nih.gov/nuccore/DP000824

AGAGATCTAGTTTTCCACCAAAGATTTAAAATATCTCTCTGACCTCCTTTTTTTTGGGGAGGGGGGGGAG
GGTTTGAAGTTTTCATTTGCCCAAGGTTGAGGTCTCAGGCAAAGTAGGCAGTGTTTTCAAGGAAAGGGTT
CGGGTTCGGGTTCGGGTTCGGGTTCGGGTTCGGGTTAGGGTTCGGGTTAGGGTTAGGGTTCAGGTTAGGG
TTAGGGTTCGGGTTAGGGTTCGGGTTAGGGTTCGGGTTTAGGGTTAGGGTTCGGGTTAGGGTTAGGGGTT
AGGGTTAGGGATAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTCGGGTTAGGGTTAGGGTTCGGGTTAGG
TTTGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGTTTAGGGTTTAGGGTTAGGGTTACGGT
TAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGTGTTAGGGTTAGGGTTAGGGTCAGGGTTAGGGTTTA
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
GTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGATTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT
GATAAGGCATTCTCTAGTGTCTCCCCAGGGGCTTCAGACTTCCCTTCATCTTGTGACACGTAATCCAGCT
TGTACTCAAGTCAGTGCAGGAATTCCGGCCTGATTTCCAGTCAGGGCATTTCAGGGTCGATTCCACTGGA

http://www.ncbi.nlm.nih.gov/nuccore/DP000825

TAGAATATCATGGCCATCAAACGTCGGGTCATCTTTCTTTCCTGATGGCATAGTTTTACCGCTGAATTAT
ACACAGATCAGCTGACAAGGTGATGTGAACCTGCGGAAGGAAGGATCACTAACGTGGTTCGGGAAAGGGG
TTTGGGTTCGGGTTCGGGTTCGGGTTCGGGTTCGGGTTCGGGTTAGGGTTCGGGTTAGGGTTAGGGTTAG
GGTTAGGGTTTGGGTTTGGGTTAGGGTTAGGGTTCGGGTTCGGGTTAGGGTTAGGGTTCGGGTTAGGGTT
CGGGTTAGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTCGGGTTCGGGTTCAGGTTAGGGTTAGG
GTTAGGGTTAGCGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGTGTTAGGGTTAGGGTTAGGCT
TAGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTCAGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTA
GGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGGTTAGGGTTAGGGTAAGGTTTAGGGTTAGGGGTTAGGGT
TAGGGTTAGCGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTAGGGTTCGGGTTAGGGTTAGGGTTAGG
GTTAGGTTTCGGGTTAGGGTTCGGCTTAGGGTTCGGGTTAGGGTTTAGGGTTAGGGTTTAGGGTTAGGGT
TAGGGTTAGGGTTCGGGTTCGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGCTCCTTTCAAGT
TTGATTGCGAGCACGGAATTGCTCTGCACGCAGTGCAGGGGAATCGGGCCTCATCTCATGGTGAGGGGGA
AGTCTCATGGTTTTTCTCGAGTTGCGGCCGGAACCTGGGATATATTCTCGAGTTACGACAGGGATGGCCC

http://www.ncbi.nlm.nih.gov/nuccore/DP000826

CTAGGTCAGGTCATCTTTCCTTTCTGTAGTCATAGTTAGACAGATCAGCTGATAAAAACCTTGGATTGTG
TCTGTGGGGTGGATTTGCTTGTCATTGTGCTTCAGCTGGCCAAGGTAGCAACGCCCGCGCCCCTCCCACG
GAAGTAGGAGTGGGGGCGGGGCGCACCGGGGGAGTGTCGGGCCGGGTTAGGGCTAGGGCTTAGGGCTTAG
GGCTTAGGGCTTAGGGCTTAGGGCTGAGGTTGGGGTTAAGGCTTAGGGTTAGGGCTGGGGTTGGGGTTAG
GGCTTAGGGCTTAGGGCTAGGGTTAGGGTTGGGCTTAGGGCTAGGGTTAGGGCTTAGGGCTTAGGGCTAG
GGCTTAGGGCTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAG
GGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTACGGTTAGCGTTAGGGTTAGGGTTAGGGGTTAGGG
TTAGGGTTAGGGTTAGGGTTACGGTTAGCGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTA
GGGTTAGGTTTAGGGTTAGGGTTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGTCTGGATTGTCT
CCAGAGCCATCCCGCCTTCCCCATCAAACATGACAAGTGGCTTGACTTCCTTTAGACAGTTCCGAAGATT
CCTCGAGAATATCGTCGTCACAAGTCTAGAGGAACACCAAGTTCAGCACAGCAACTCGAGAAAAGCTCCG

http://www.ncbi.nlm.nih.gov/nuccore/DP000827

CTAGCTACCAGTCATTTAGAATAATTTACTGGATGCCGTCAACCACATTCTGTTTAGAGTGTACATATGC
AAATTGAATACAAGAAAAAAAAAACAACGAAACAGAACCCACTCATCTGGTTTTAAACCAAGATCATAAT
CACTATATCTTTCTCAATATATGAGATGTTAGTGAAAAATAATGTTAGGGTTAGGGGTTAGGGTTAGGGT
TAGGGTTGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTCGGGGTTAGGGGTTAGGGTTAGGGGTT
AGGGTTAGGGTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGGTTAGGGGTTAGGGGTTAGGGTTAGGGTTA
GGGGTAAGGGGTTAGGGTTAGGCGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTGGGTTAGGGTTAG
GGTTAGGGTTACAGTTAGGGTTAGGGTTAGAGTCAGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTA
GGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGCTTGATTCCCACGTGATTCCCTCCTGATTCC
CACGTGATTCCCACGTGATTCCCACGTGATTCCCACGTCACTCCCACGTGATTCCCACGTGATTCCCACG
TGACTGAGACGTGATTCCCACGTGATTCCCACGTGATTCCCACGCGATTTCCACATGATTCCCACGTGAT

http://www.ncbi.nlm.nih.gov/nuccore/DP000828

GCTGCATATAGGCCCAGAGTTCTGAGATGGTGGAATGAACCAACTGGACACTTAAAGAGACACTCCAAGT
GGATTAGAGAGACTGACTGCTCTAGGGATTGGGCTAGGGCTCGGGCTTGGGCTCGGGTTTGGGGTTCGGG
TTCGGGGTTCGGGTTCAGGCTCGGGCTCGGGTTTGGGGTTCGGGTTCGGGTTAGGGTTCGAGTTAGGGTT
CGGGTTCGGGTTCGGGTTAGGGTTCGGGTTCGGGTTCGGGTTCGGGTTAGGGTTAGGGTTCGGGTTCGGG
TTAGGGGTTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTT
AGGGTTAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTTAGGGTT
AGGGTTAGGGTTAAGGTTAGGGTTAGGGGTTAGGGTTAGGGTTCGGGTTCGGGTTCGGGTTAGGGTTAGG
GTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGTTAGGTTTAGGGTTAGTGTTAGGGTTTAGGGTTAGGGTT
AGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGCTTAGGGATTAGGGTTAGGGTTAG
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGCGTTTAGGTTTAGGATTAGGT
TTAGGGTTAAGGTTAGGGTTAGGGTTTGGGTTAGGGTTAGCTCCCAGACGTACAGCTGCGAGAAGCCTCC
TCTTGTGGTGCTTGTGGACAGTTGGCATTCCTCTTGATTTGAAGCTAGGAAATCAGCCCTCACCTCGAGA
TGATTGGCGGAACACGGAGCTCTTTCTGCTTGCTGCAGTGACCTCAGTTTCCATCTTGACTTGAGACAGT

http://www.ncbi.nlm.nih.gov/nuccore/DP000829

CCGTGTAAACACACAGCCTTCCTAGGCCTTATGCCTCCTCGTCCATAAGGGGATGGTGGTTTTTCTGCTC
ATGGGGTGGGGGGAGGGCACACCTATGGCCACAGCACTGGCTCCATGGACAGCATGGTTCCTTGGGGGCC
TGGAGCCCCACAACTGATAAGACTGACACAAGAGCTGACATTAGGGTTCGGGTTCGGGTCAGGGTTCGGG
TCAGGGTTCGGGTCAGGGGTTAGGGTCAGGGTTAGGGTCAGGGGTTAGGGTCAGGGTTAGGGTTAGGGTT
AGGGGTTAGGGTCAGGGTCAGGGTCAGGGTCAGGGTCAGGGGTTAGGGTCAGGGTCAGGGTCAGGGTCAG
GGTCAGGGTTAGGGGTTAGGGTTAGGGTTATGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGG
GTTAGGGTTAGGGTTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTTAGG
GTTAGGGTTAGGGTTAGGGTTTAGGGTTAGAGTTAAGGTTAGGGTTAGGGTTAGGGGTTAGGGTCAGGGT
CAGGGTCAGGGGTTAGGGTCAGGGTCAGGGTCAGGGTCAGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAG
GGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTTAGGGTTAGGATAAGGGTTAGGGTTTAGGGTTAGG
GTTAGGGTTAGGGTTTAGGGTTTAGGGTTTAGGGTTAGGGTTAGGGTTAGGTTTAGTGTTTAGGGTTAGG
GTTAGGGTTAGGTTTAGGTTAGGGTTAGGGTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTA
GGGTTAGGGTTAGGCTTTAGGGTTAGGGTTAGGGTTTAGGATTAGGGTTAGGGTTAGGGTTAGGGTTAGG
GTTAGGGTTTAGGGTTAGGGTTAGGGTTACGGTATAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGG
TTAGGTTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTGGAGGCCGCAAATTCAACCTCCCTCAACCAGAC
CTACAGCTGCGAGAAGCCTCCTCTTGTGGTGCTTGTGGACAGTTGGCATTCCTCTTGATTTGAAGCCAGG
AAATCAGCTCTCACCTCGAGATGATTCGGAATACACGGAGCTGTTTCTGGTTGGTGAAGTGACTTTAGGA

http://www.ncbi.nlm.nih.gov/nuccore/DP000830

CGTAGGAATGTGCCCATGAGATGAAAATATTGTCCTGAGTTGAAACGGAAGTTAAATTCAAAATCACGTT
GATGAGGCCAAGCAGCAGGAGTTGATTATGTGACTTCATCCTGAAGGGGCTCATCCGGGATCTTCTTTTC
CTTCTTCTTCCTGAGGATGATCTTAGTCTCTTGGGGGTCAGTGGAGTCCCTCTCTGCAGCCTGTTAGGGT
TAGTGTTAGGGTTAGGGTTAGGGTTTAGGGATAGAGTTAGGGTTAGGGTTAGGGTTAGGGTTAGTTAGGG
TTAGGGTTAGGGTTAGGATTAGGGTTTAGGGTAAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTT
AGGGTTAGGGTTAGGGTTAGTAATGGATGGGAGGCCGCCTGTCGAGAAAGGGCAGGGAGCTAGGGCTTTC
TCTAGTGTCTCCGCAGGGGCTTCAGACATCCCTTCATCTTGTGAGATGAAATCCAGCTTGCACTCAGGTC
ACTGCAGGAATTCCGGCCTGATTTCGTGTCAGGGCATCTCGGGATCGATTCCACTGGAGCTCGCAAATTC

In the Immortal Words of Dr Tomkins

First, the sequence was only about 800 bases long—not the 10,000 bases or more you would expect if two 5,000-base (or larger) telomeres fused together.

Second, the fusion-like sequence was very degenerate and only 70% similar to what one would expect of a pristine fusion sequence of the same size.

More DNA Evidence Against Human Chromosome Fusion

Who here agrees with him?

Advertisements

Chromosome 2 Fusion – The Cryptic Centromere

This is a brief tutorial on how one goes about demonstrating the existence of a cryptic centromere on human chromosome 2. It is in response to this point from Jeff Tomkins:

“The purported cryptic centromere on human chromosome 2, like the fusion site, is in a very different location to that predicted by a fusion event.”

New Research Undermines Key Argument for Human Evolution

So, first of all I need to mention that Jeff Tomkins implicitly admits that there is such a putative centromere, but his objection is that it is not where it should be. Nevertheless I’ll show you how to find it and then show that it is where it is expected it to be.

So what are we looking for?

“The DNA evidence in question is based on the fact that human, great-ape, and other mammalian centromeres are composed of a highly variable class of DNA sequence that is repeated over and over called alpha-satellite or alphoid DNA. Alphoid DNA, although found in centromeric areas, is not unique to centromeres and is even highly variable between homologous regions throughout the same mammalian genome.”

So basically what we are looking for is a large cluster of these alphoid sequences. As Tomkins states, alphoid sequences are not unique to centromeres, but we shouldn’t find large clusters elsewhere on the chromosome.

BLAST away!

So let’s get a list of all the alphoid sequences that we can find on chromosome 2:

[glenn@macha] cat alphoid.fa
>gi|117911456|emb|CS444613.1| Sequence 51 from Patent WO2006110680
CATTCTCAGAAACTTCTTTGTGATGTGTGCATTCAACTCACAGAGTTGAACCTTCCTTTTCATAGAGCAG
TTTTGAAACACTCTTTTTGTAGAATCTGCAAGTGGATATTTGGACCGCTTTGAGGCCTTCGTTGGAAACG
GGAATATCTTCATATAAAAACTAGACAGAAG

… and then …

[glenn@macha] blastn -query alphoid.fa -subject /Users/glenn/Data/hg19/chr2.fa
 -outfmt '10 sstart send pident nident length evalue' -out alphoid.csv
 -task blastn -dust no -soft_masking false -word_size 7 -evalue 1e-30

This command will search chromosome 2 for anything that looks like an alphoid sequence, and write the results to a file named alphoid.csv, and this is what the file looks like after it has been sorted:

70658558,70658701,86.806,125,144,1.10e-42
92272684,92272854,84.211,144,171,1.74e-46
92272855,92273025,88.304,151,171,2.95e-56
92273026,92273194,80.117,137,171,4.38e-35
92273195,92273363,85.294,145,170,1.74e-46
92273366,92273535,80.588,137,170,8.46e-38
92273537,92273707,85.380,146,171,3.37e-49
92274458,92274566,89.091,98,110,6.51e-33
92274567,92274738,83.721,144,172,7.42e-45
92274739,92274909,85.380,146,171,3.37e-49
92274910,92275079,84.706,144,170,1.43e-47
92275081,92275250,85.965,147,171,9.65e-50
...
...
...

That first field (“sstart” in our command above) is where the matching DNA starts on chromosome 2. So if you look at the file in its entirety, you’ll see that there are 483 matches for this alphoid sequence across chromosome 2, and the vast majority – all but 2 of those 483 matches – are clustered around two locations.

The first location is around the 92Mb mark – and this corresponds to the beginning of the active centromere; the second location is around the 133Mb mark.

Could this be our centromere?

Well it certainly is a cluster of alphoid sequences, but it is in the right place? Let’s have a look at the genes either side of this cluster:

CentromereSynteny
http://grch37.ensembl.org/Homo_sapiens/Location/Synteny?db=core&r=2%3A132000000-134000000&otherspecies=Pan_troglodytes

What you should be looking at here are all the genes that precede the cryptic centromere (from PLEKHB2 down to ANKRD30BL) and their corresponding position on chimpanzee chromosome 2B. Now a couple of the corresponding chimpanzee genes are found on scaffolds (the ones beginning with AACZ or GL), but for the genes that have been placed on the chromosome, you can see that they are all around the 132Mb mark.

For the genes on the other side of the cryptic centromere (GPR39 and LYPD1) you’ll notice that the corresponding genes on chimpanzee chromosome 2B are found near the 136Mb mark.

And what pray tell is in that gap between 132Mb and 136Mb on chimpanzee chromosome 2B? The centromere!

To recap

  1. On human chromosome 2 there are two clusters of alphoid sequences.
  2. One of those clusters is the current active centromere.
  3. The other cluster corresponds well to the centromere on chimpanzee chromosome 2B.

I’m gonna say it’s our cryptic centromere …

Chromosome 2 Fusion – It’s a Binding Site. Whoopty-frikkin-do.

Both Jeff Tomkins:

“Clearly, the putative 800 base fusion site is not a degenerate fusion sequence, but a transcriptionally functional and active DNA binding motif read on the minus strand inside the DDX11L2 gene.”

Alleged Human Chromosome 2 “Fusion Site” Encodes an Active DNA Binding Domain Inside a Complex and Highly Expressed Gene—Negating Fusion

… and Cornelius Hunter over at Darwin’s God:

“Genes shouldn’t be there, regardless of expression level, and TFs shouldn’t be binding there.”

The Naked Ape: BioLogos on Human Chromosome Two

… seem to make much of the fact that the fusion site on Human chromosome 2 contains a transcription factor binding site.

The Evidence

In Tomkins’ paper, he posts an image of the UCSC Genome Browser that shows the evidence for transcription factor binding activity at the fusion site. I’ve reproduced the image below, but zoomed in on the relevant sections:

genome_browser_fusion

The first section (in red) shows where the 798 base pair fusion site sits on the chromosome. The next section shows the two DDX11L2 transcripts, transcribed from right to left. As you can see, the longer transcript completely encompasses the fusion site. So far, so good.

But what about the green bumps and the grey bars?

The Green Bumps

Without getting into the details of ChIP-seq, the bumps seen here are a good proxy for the relative strength of the binding site. A high peak signifies a strong and/or frequent bond, while a low peak signifies a weak and/or infrequent bond. The particular transcription factor here is named CTCF, and the highest peak we can see in the image above is 0.0222.

But how big is 0.0222? What do we have to compare it to? Back on the Wikipedia page for CTCF, it says that there are “anywhere between 15,000 – 40,000 CTCF binding sites” in the human genome. That would imply that there are somewhere between 1,200 and 3,100 CTCF binding sites on chromosome 2 alone. Maybe the binding site in the fusion sequence is counted among them?

No. Not even close. Fortunately the ENCODE data behind those green bumps is freely available. If you’re really keen, you can download the file here:

https://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeOpenChromChip

(Cell Line = H1-hESC; Antibody Target = CTCF (07-729); View = Peaks)

A cursory glance will tell you that a value of 0.0222 doesn’t even make it into the published data – the minimum value to be counted as a peak by ENCODE is 0.1000. If you’re not good at math, that’s 4.5 times taller than Dr Tomkins’ biggest green bump.

And how many peaks are there on chromosome 2 taller than 0.1000, you ask? Almost 6,000 of them. And how tall are they? Well, if I were to take the 1,000 tallest peaks on chromosome 2, the average height would be 2.1188. Yup, that’s almost one hundred times higher than the tallest peak in the fusion site. Can you see that little red pixel? That’s the binding site.

BindingSitesOnChr2

Kinda puts things into perspective, doesn’t it?

The Grey Bars

This is where things get a lot more interesting. The grey bars at the bottom of the image represent the collated signals for all the different transcription factors. In a similar fashion to the CTCF data above, a black bar represents a strong and/or frequent bond, while a grey bar represents a weak and/or infrequent bond.

If you dig down into the data, you’ll see that the transcription factor that causes the bar to be black is RNA Polymerase II. Great. But what is such a strong signal doing in a region that was supposedly caused by a fusion of sub-telomeric DNA? Both Dr Tomkins and Dr Hunter seem to think that transcription factor binding sites and sub-telomeric DNA are mutually exclusive.

No. No they are not.

Let’s move away from DDX11L2 for a minute and have a look at DDX11L1. It’s at the beginning of chromosome 1.

DDX11L1

Would you like to see the DNA from the binding site immediately upstream? Here it is:

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr1:10134-10362
ACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
CTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCC
TAACCCTAACCCTAACCCTACCCTAACCC

Does it remind you of anything? Telomere repeats, perhaps?

Let’s look at DDX11L5 on chromosome 9. I wonder what the binding site sequence looks like there.

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr9:9965-10327
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTAACCCTAACCCTA
ACCCTAACCCAACCCCACCCCAACCCCAACCCCAACCCAACCCTAACCCT
AACCCTAACCCAACCCTAACCCTAACCCTAACCCAACCCTCACCCTCACC
CTCACCCTCACCCTCACCCTCACCCTCACCCTAACCCTACCCTAACCCCT
AACCCCTAACCCCTAACCCCTAACCCTTAACCCTAACCCTAACCCTACCC
TAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTAACC
CTAACCCTAACCCCTAACCCCTAACCTCTAACCCTAAACCCTAAACCCTA
AACCCTAAACCCT

Yup. Looks awfully like telomere repeats. The transcription factor binding site for DDX11L9 – right at the end of chromosome 15 – looks like this:

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr15:102521116-102521200
TAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTA
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGT

Should I go on, or are Dr Tomkins and Dr Hunter willing to concede the point?

Jeff Tomkins’ Orphan Genes – Mind The Gap!

In January 2016, Dr Jeffrey Tomkins posted an article on ICR’s website claiming that the genetic gap between humans and chimpanzees is getting wider.

This time he cites a PLoS paper titled “Origins of De Novo Genes in Human and Chimpanzee”, and makes the following comment:

In yet another recent research report, scientists describe 634 orphan genes in humans and 780 in chimpanzees. In other words, we now have a new set of 1,307 genes that are completely different between humans and chimpanzees.

Now, it’s not the elementary math error that bothers me here (I can’t say I’m surprised any more by how sloppy Tomkins can be) it is the claim that these genes are “completely different” and that they “are found in no other type of creature and therefore have no evolutionary history”.

The authors have kindly made their data available, so I thought I might test Tomkins’ claim, and – as a few people have suggested – I’ll describe the process so that any of you can replicate the results if you are so inclined.

If you’re not so inclined you can probably skip to the end.

Step 1 – Where Are These Genes?

If you scroll down about halfway in the Ruiz-Orera paper, you’ll see links to two GTF files:

For now we’ll just be searching for the human genes; try to find a homolog in the chimpanzee genome. So click on the human link, and download the file – hsa_denovo.gtf.

If you have a look inside these files you’ll see the coordinates for each coding sequence / exon … but no DNA sequence. Sad Panda.

Step 2 – Get The DNA Sequence. Obviously.

Now this part isn’t trivial. I wrote some code in C++ that took this GTF file as input, parsed out the coordinates for each sequence, carved that sequence from the human genome, and then created one FASTA file per chromosome.

If you’re interested in looking at the code, you will find it here:

https://github.com/NambyPamby/MindTheGap

I decided to exclude any sequences shorter than 30 base pairs because short sequences will almost certainly find a match somewhere on the corresponding chimpanzee chromosome, and that would only serve to inflate the final result.

Step 3 – BLAST Away!

For each FASTA file we run a BLAST against the corresponding chimpanzee chromosome – ‘run.sh‘ takes care of this by calling ‘blast_chrome.sh‘ for each FASTA file. Depending on how powerful your machine is, this part might take a few hours. At the end you should have 24 CSV files with the results – one for each of the autosomes, and then two more for the X and Y chromosomes.

Step 4 – Make Sense Of The Results

The ‘analyse.sh‘ script reads in these CSV files and prints out some statistics for each chromosome. The two most important columns are the number of nucleotides I queried and the number of identical nucleotides I actually found. I also count the number of queries I submitted and how many results I got.

Drum Roll, Please.

Who would have thunk it? These human genes which supposedly had no evolutionary history have corresponding sequence in the chimpanzee genome which is – on average – about 95.41% identical.

MindTheGap

But why is this post critical of Tomkins and not of the PLoS paper? I suggest reading the PLoS paper first, then reading Tomkins article, and let me know in the comments just how sloppy Tomkins is in his interpretations and conclusions.

Where did that guy get his data from?

Over at Evolution News and Views, I got a shout out from Ann Gauger with a tone that somewhat suggests that my data might be wrong. This post is for her.

Dear Ann,

BLASTN 2.2.30+


Query= 8 dna:chromosome chromosome:Galgal4:8:17537065:17580564:1

Length=43500

Subject= 1 dna:chromosome chromosome:GRCh38:1:78700000:78800000:-1

Length=100001


 Score =   291 bits (322),  Expect = 7e-79
 Identities = 361/483 (75%), Gaps = 22/483 (5%)
 Strand=Plus/Plus

Query  620    AATTATGAAAGCATACTTTTC-AGTGGTATTCCAGAGAAAGGACTTGCAAGAACTGGAAT  678
              || ||||||||  ||| |||| ||||   ||||| | | || ||||  |||| |||| ||
Sbjct  10940  AAATATGAAAG--TACATTTCTAGTGTATTTCCACA-ACAGTACTTAGAAGACCTGGGAT  10996

Query  679    AAGGATAAGAAGTGAAGTGGAAATTAGTGGTATTGGACCAAAACTTTGTCTTATTAGGGT  738
                  | || |||| |||| ||| | | ||| |  |  |  ||||   || |||||| |||
Sbjct  10997  GTAAACAAAAAGTAAAGTAGAAGTCACTGGCACAGATCTGAAACCAAGTTTTATTAAGGT  11056

Query  739    AAGAAAATTTATTCAATTTGAAAGGTAAAATTCTCTTGATACCAGTTTGTTGGGTTTTTT  798
              |||||||| ||| ||||||||||| |  ||||  |||        |||| |    |||| 
Sbjct  11057  AAGAAAATATATCCAATTTGAAAGCTGGAATTACCTTC-------TTTGATAACGTTTTC  11109

Query  799    TTTTTAAGCTTTTGGGAAGTAATTAAGTTTCATCATATGTTGTGCTTACTCAGGCAGAAT  858
                   ||||||| |  || |||||||||||||||||||||| |||||||||| |||||||
Sbjct  11110  -----AAGCTTTGGACAAATAATTAAGTTTCATCATATGTTTTGCTTACTCATGCAGAAT  11164

Query  859    GTAACTAACACTACTGTTTTTTTATT-CAGTGCTCTAAATTCTATTTG-CACTTT-GCCA  915
              ||||||||  ||  | |||||||| | |||    |||||||| ||||  |||| | ||||
Sbjct  11165  GTAACTAAGTCTTTTTTTTTTTTAATGCAGAAGCCTAAATTCCATTTCACACTGTAGCCA  11224

Query  916    GGTAATTCTCAGCTCAAGCCAACCTTGGGCTTGAAGGATTTCTTCTGCTTTGTGGCCAGG  975
              |  |||||||||||||||| |||||||||||||| |||||||||||||||||||  ||||
Sbjct  11225  GACAATTCTCAGCTCAAGCTAACCTTGGGCTTGAGGGATTTCTTCTGCTTTGTGCTCAGG  11284

Query  976    GAGACAATGGAATGTAATTTGAAATGCACAGTAATTTGTTATTGGATCAATCCAATTGTT  1035
              |||||||| ||   ||||||  ||||||||| | ||| ||    ||||||| ||||||| 
Sbjct  11285  GAGACAATAGAGCATAATTTTGAATGCACAGCAGTTTATTCCAAGATCAATTCAATTGTA  11344

Query  1036   CC-AAACTGTACAACCTAGGATTATTTAATCAACTGATTTCGTAGCCAGCAAACGAAAGG  1094
              || ||||  || ||| |  |||||||||||||||| ||||| ||  ||| |||| ||| |
Sbjct  11345  CCAAAACCATATAACTTTAGATTATTTAATCAACTTATTTCATAAGCAG-AAAC-AAATG  11402

Query  1095   CAA  1097
              |||
Sbjct  11403  CAA  11405


 Score =   223 bits (246),  Expect = 3e-58
 Identities = 287/390 (74%), Gaps = 31/390 (8%)
 Strand=Plus/Plus

Query  24111  TTTTTCTTGTTAAAGGACCAGAGCTGGCTCTTGCAGCCTATTTTTACAGTACCGTGTGAT  24170
              |||||  |||||||  ||||| ||   |||| || |||||||||||   | | || | ||
Sbjct  41719  TTTTTAATGTTAAAACACCAGCGCCAACTCTGGCTGCCTATTTTTATTATGCTGTATAAT  41778

Query  24171  TCTGCAGACATTGACATGTGTCACCTGTGATGCAGCTACATTTGTCG-GCTCTCTGTGCT  24229
              ||  |||  || ||||||||||||||||| |||||  ||||||| |  ||||| ||||||
Sbjct  41779  TCCACAGGTATCGACATGTGTCACCTGTGTTGCAGTCACATTTGGCCAGCTCTTTGTGCT  41838

Query  24230  CAACAGGGAGGAATCGATCTTCTACTTTCATTAGGTGGCAGGAGTAGACTATTGGCATAA  24289
              ||  ||||| | ||| ||||||  ||||||||| |||| || | ||||| ||||||||||
Sbjct  41839  CACTAGGGAAGCATCAATCTTCAGCTTTCATTAAGTGGTAGAAATAGACCATTGGCATAA  41898

Query  24290  AAAAT--ACTAAAAAAAAAAATGGAAAAGAAACCCCAGGGCTTTCTGCTTGGAGAC--CC  24345
              |||||    |  ||||| ||||||   ||||          ||||||| ||| |||  | 
Sbjct  41899  AAAATTATTTTTAAAAATAAATGG--GAGAA---------TTTTCTGCCTGGGGACTACA  41947

Query  24346  AGACTGCTGTTCAGTGGTCATTTGAATTATTGAATGGGATTAAAATAAAAGCCATTTC--  24403
                ||| ||||||  | || ||||||||||||||||||  |  ||||||||||||||||  
Sbjct  41948  CCACTACTGTTCTCTAGTAATTTGAATTATTGAATGGACT--AAATAAAAGCCATTTCTA  42005

Query  24404  --CTTTTT----TTTGTCCCTTTACTCAGATGAGCCATCTGAAATGCAAGTTGATTTGT-  24456
                ||||||    ||| |   ||| | ||||  |   | ||||||||||||||||||| | 
Sbjct  42006  TTCTTTTTATTCTTTTTTTTTTTGCCCAGACAA---ACCTGAAATGCAAGTTGATTTTTT  42062

Query  24457  -ATTTTCTTTTATCCCCTCAGCTTGTTAGG  24485
               |   |||||||||||||||||||||||||
Sbjct  42063  AAAAATCTTTTATCCCCTCAGCTTGTTAGG  42092


 Score =   219 bits (242),  Expect = 4e-57
 Identities = 269/362 (74%), Gaps = 19/362 (5%)
 Strand=Plus/Plus

Query  41677  AAAATTATGAAAGCTTTACCCATTCTCATTATGCAAACTTAATCTAAATGGATGTCCTAA  41736
              ||||| | | ||||||| | ||||||  ||| |||||| ||||||||| | | |  ||||
Sbjct  84967  AAAATCAGGGAAGCTTTGCACATTCTAGTTACGCAAACGTAATCTAAACGAAGGCTCTAA  85026

Query  41737  TATTCTTCCAGATGCAACGATAAACCTCCAACTATCTAAGAATATTTATTGGGAGGATGA  41796
              || |||||   || |   |||||||||||||  |  | ||||| |||||||| |||||||
Sbjct  85027  TACTCTTCTCTATACCTTGATAAACCTCCAAACAGTTGAGAATTTTTATTGGAAGGATGA  85086

Query  41797  GTATTATTATTCCATTTAGATTATTTCAGCATTAAGGGATATGGCTTATTCAAGCTGCTG  41856
               |  |||||||||||||||           ||||| |||||||||||||||| ||||| |
Sbjct  85087  ATCATATTATTCCATTTAG-----------ATTAAAGGATATGGCTTATTCAGGCTGCAG  85135

Query  41857  TTGATAAAGCAATGTGGTAAGGTTAATACTGCAACTGAC-AATGCTCTGCCAGATTTCAC  41915
               | |||||||   || || || ||||| ||  ||||||| |||||  |||||||||| | 
Sbjct  85136  ATAATAAAGCCGCGTAGTCAGCTTAATGCTAAAACTGACAAATGCGGTGCCAGATTTGAA  85195

Query  41916  AATATATGGCAAACTTTAATTAGAAGTTTATGAACCTCTGAAAATTCTCCGAAGGGCTTA  41975
              || |||  | ||| ||||||| ||||||||||||  ||||||||  |||| ||||| |||
Sbjct  85196  AACATACAGGAAATTTTAATTGGAAGTTTATGAAAGTCTGAAAA-ACTCCAAAGGGTTTA  85254

Query  41976  TCCTTCAGGATGAACTT-CGACAAA--TAGTCAGCTGAAATATGCAGTGATATGCA-GCA  42031
              |||||    || |||||   |||||  |||  |||| | ||||    ||||||||| |||
Sbjct  85255  TCCTTAGAAATAAACTTAAAACAAAATTAG--AGCTAATATATTAGCTGATATGCAGGCA  85312

Query  42032  AT  42033
              ||
Sbjct  85313  AT  85314


 Score =   102 bits (112),  Expect = 7e-22
 Identities = 219/321 (68%), Gaps = 40/321 (12%)
 Strand=Plus/Plus

Query  42618  AATGCATTATGTACAGTCTGCACTGCTTAATAAATATGTTGTTCATTAAATAAGTATTCA  42677
              ||||||||||||||||||||   ||||||||||||   ||  |||| |||||   |||| 
Sbjct  85882  AATGCATTATGTACAGTCTG---TGCTTAATAAATGCATTACTCATAAAATATACATTCG  85938

Query  42678  TGTGGTCTCCCTCTTTATTTTTCCATATCAACAAAACAATCAGACCAACTATAATATTAT  42737
                 |  ||||          || | |||||| | | ||||||  |  |  |||| |||||
Sbjct  85939  CAAGTCCTCC----------TTTCTTATCAATAGAGCAATCAAGCTGAGCATAAGATTAT  85988

Query  42738  CCAGAAATTCTGCTTCTTTTTAT-CTGAAATATAATTATAGCAGTCCTCTCTTAAAATTA  42796
                ||||||||  || |||  ||| |  ||||||||| | |  | |||||| ||||||| |
Sbjct  85989  TGAGAAATTCAACTCCTTCCTATACAAAAATATAATCACATTA-TCCTCTTTTAAAATCA  86047

Query  42797  TGTTCACTAGGTGATGAAAGGAAA----ACATGATTACAGCTACTGCTAACATTCCATTG  42852
              || |||      ||||||| ||||    | ||| ||  ||||     ||| ||| |||||
Sbjct  86048  TGCTCA------GATGAAAAGAAACAAGAGATGGTT--AGCT-----TAATATTTCATTG  86094

Query  42853  TGTAAAGAATTTCATATTTAGCATACTAAAGACACAGTAAGCATTTGTTTTCTTTTATGT  42912
              |  |||| |||| ||||||| | |||| || | ||  ||  ||||| |||||||| ||||
Sbjct  86095  TAAAAAGCATTTTATATTTAACCTACTCAAAATAC--TA--CATTT-TTTTCTTTCATGT  86149

Query  42913  TTTCTGGTA---AAGAGAAAA  42930
              || |  | |   |||||||||
Sbjct  86150  TTCCCAGCATAGAAGAGAAAA  86170


 Score = 57.2 bits (62),  Expect = 3e-08
 Identities = 45/54 (83%), Gaps = 0/54 (0%)
 Strand=Plus/Plus

Query  21281  AATACACTCTGGAAAACTATTGTAGCCAGATGCCAAACAAATGAAAACAGATGG  21334
              |||| || |||  |||| |||||| | | || ||||||||||||||||||||||
Sbjct  40402  AATATACCCTGTGAAACAATTGTAACTAAATACCAAACAAATGAAAACAGATGG  40455


 Score = 51.8 bits (56),  Expect = 1e-06
 Identities = 173/264 (66%), Gaps = 14/264 (5%)
 Strand=Plus/Plus

Query  19077  CATTTCATAAGTTGGTATTATAGAATATGACTG-AATGTAAATGATATAATAGTAGCAAT  19135
              ||||  || || || || || ||||| | |  | |||| ||||||| |    |  |||||
Sbjct  36099  CATTAGATGAGCTGATACTACAGAATGTTAAAGGAATGCAAATGATCTGGCTGGTGCAAT  36158

Query  19136  AAGAAAAATAGCATAGTCACTGCGCATTGAGCA--GGTACTTATAATTTTGCCAATTAAT  19193
                ||||||||  ||| |  |   | | ||| ||  || |||  ||||| ||    |||||
Sbjct  36159  TGGAAAAATATTATAATTTCCATGAAATGAACAAAGGGACTC-TAATTATGTATGTTAAT  36217

Query  19194  AGTTCAAAAAGGCAGAATATTCTGTTTATGGCTGTTTTATAATATTGGTTTTGTAGTTGA  19253
              ||  |||| | ||| | | ||||| |||||   |||| || | ||  ||||| | |||||
Sbjct  36218  AGGACAAAGATGCACAGTGTTCTGCTTATGATGGTTTAATGACATCAGTTTTATGGTTGA  36277

Query  19254  TTTTTCATATA-ATCTTGTATCAG--------TTGTGTTATGCAATTAGTAAACTGATAC  19304
                ||||  | |  |||  ||||||        |  | ||| |||||||||||  |||| |
Sbjct  36278  GATTTCCAACATTTCTCATATCAGACTTTATATCATATTAGGCAATTAGTAACTTGATTC  36337

Query  19305  AATCTGCA-AATGTGGCTTTAAAA  19327
              |  ||||| |  |||| |||||||
Sbjct  36338  ACACTGCACAGGGTGGTTTTAAAA  36361

You’re welcome.

Chromosome 2 Fusion – Dead in a Day?

Ian Juby has stated explicitly that DDX11L2 is “critical for life” both here and here, where he refers to a paper written by Dr Jeffrey Tomkins in 2013.

But just how important is this DDX11L2 gene anyway? Well, we can get some clues just by looking at the name.

A rose by any other name?

Let’s break down the name of the gene itself – DDX – 11 – L – 2

DDX is short for “DEAD Box” – which is an RNA Helicase gene. Helicase genes are incredibly important, since these are the machines that unwind the DNA or RNA so that other molecular machines can read the underlying information.

There is an enormous family of DDX genes across our genome – DDX1, DDX2A, DDX2B, DDX3X, DDX3Y, DDX4, DDX5, DDX6, DDX10, DDX11, DDX17, DDX18, DDX19A, DDX19B, DDX20, DDX21, DDX23, DDX24, DDX25, DDX27, DDX28, DDX31, DDX39A, DDX39B, DDX41, DDX42, DDX43, DDX46, DDX47, DDX48, DDX49, DDX50, DDX51, DDX52, DDX53, DDX54, DDX55, DDX56, DDX58, DDX59 and DDX60.

Yup, that’s a grand total 41 DEAD Box Helicase protein-coding genes in the human genome.

DDX11 is just one of those 41 DEAD Box Helicase protein-coding genes. There’s nothing obviously special about DDX11 that makes it stand out from all the other DDX genes.

L stands for “Like”. Back in 2009, Valerio Costa reported on a transcripts family with 18 members whose nucleotide sequence bears a strong resemblance to DDX11. In other words, it looks “Like” DDX11, but it does not code for a protein – it’s a pseudogene. Note that all of these sequences are found in subtelomeric regions. Except for one. Can you guess which one?

2. Yup, DDX11L2 isn’t found near telomeres in the modern human genome, it’s all by itself in the middle of chromosome 2 – make of that what you will. To break things down even further, there are actually two transcripts for this pseudogene – NR_024004.1, which is the longer of the two transcripts, and straddles the putative fusion site – and NR_024005.2, the shorter transcript which does not cross the fusion site. Even if we hypothetically split chromosome 2 at the fusion point, the DDX11L2 pseudogene would still exist. All we would lose is this longer transcript.

So just to recap, what we are talking about here is a solitary alternately-spliced transcript from a pseudogene that is fairly similar to 17 other pseudogenes, and those pseudogenes as a whole bear some resemblance to an actual protein-coding gene – DDX11 – which is itself part of a larger family of 41 genes.

Now we have a rough idea of where this DDX11L2 pseudogene sits in the scheme of things, let’s talk hard numbers.

Droppin’ English. Express Yourself.

Gene expression data is often a good guide to the relative importance of a particular gene or transcript. If you had a bacterial infection, would you go to the doctor to get some antibiotics (with enough penicillin molecules to go around killing off all the bacteria) or would you go the homoeopathic option, where the solution has been diluted so many times that there is virtually no active ingredient left?

So, let’s look at how frequently DDX11L2 is expressed. If If you click on this link you’ll see that cells in the testes express DDX11L2 at a rate of 3.905 RPKM (Reads per Kilobase per Million mapped reads), in the pituitary gland at 0.936 RPKM, in the prostate at 0.893 RPKM and in the spleen at 0.882 RPKM.

If you then go up a step and look a the expression data for DDX11 itself, you’ll see that in the testes it is expressed at a rate of 8.477 RPKM, in the pituitary gland at 8.068 RPKM, in the prostate at 9.212 RPKM and in the spleen at 10.047 RPKM.

Already we can see that the DDX11 gene expression levels dwarf those of DDX11L2, but don’t forget we have 40 other DDX genes being transcribed as well! Let’s look at some other DDX genes:

                  DDX1    DDX3X    DDX5    DDX6    DDX11  DDX11L2
Testes           33.980  50.016  138.808   8.816   8.477    3.905
Pituitary Gland  35.394  34.213  321.470   7.100   8.068    0.936
Prostate         24.214  36.621  240.220   9.998   9.212    0.893
Spleen           23.889  50.052  347.019  10.811  10.047    0.882

Wow, DDX5 is expressed almost 400 times more often in the spleen that DDX11L2!

But remember how I said there were two transcripts? Well, the expression data above are from GTEx, and according to the locus given for DDX11L2, it only gives data for the shorter transcript – the one that does not overlap the fusion site.

If we look at AceView we can actually get a breakdown of how frequently the introns are sequenced in RNA-seq studies. The intron that corresponds to the fusion site was sequenced 682 times, while the intron common to both transcripts was sequenced a total of 3,186 times, implying that the longer transcripts make up only around 21.4% of the total DDX11L2 transcripts.

So, what are we to make of these numbers?

When Tomkins claims that DDX11L2 is a “highly expressed gene“, we have to ask “highly expressed relative to what exactly?” According to the AceView link above, this gene is expressed at “only 26.8% of the average gene“. If you then take into account the fact that the transcript that spans the fusion site makes up only 21.4% of transcripts for this gene, then it is expressed only 5.7% as frequently as an average gene.

If you were to hypothetically split this chromosome in half at the fusion site, you wouldn’t be “dead in a day” as Ian Juby likes to say, you would just lose a very lowly expressed transcript of a pseudogene. That pseudogene is part of a family of 17 other similar pseudogenes (DDX11L), which as a group, bear some resemblance to an actual protein-coding gene (DDX11). That protein-coding gene is then part of a much larger group of protein-coding genes (DDX).

Chromosome 2 Fusion – The Low Hanging Fruit

Jeffrey Tomkins has written a number of articles in previous years that attempt to cast doubt on the claim that human chromosome 2 is the result of a head-to-head fusion of two ancestral chromosomes.

For the sake of brevity, I will address only a few of the more egregious errors that Dr Tomkins made in his articles; I will address the others when I have the time and the inclination.

Comparative Scale, huh.

In this article Dr Tomkins posts a diagram of the fusion supposedly drawn to scale. The desired effect here is obviously to have people believe that the human chromosome 2 doesn’t align to its chimpanzee counterparts. Here is Dr Tomkins’ diagram:

TomkinsToScale

And here is my diagram:

FusionToScale

The difference is that the PostScript code used to produce my diagram is freely available, and from that you are able to verify the genome coordinates I have used.

Dr Tomkins also claims here that the combined chimpanzee chromosomes are some 10% larger than the human chromosomes. However, according to the most recent chimpanzee assembly – named “panTro4” – the combined length of chimpanzee chromosomes 2A and 2B is 247.5Mbp while human chromosome 2 is 243.2Mbp. This is a difference of only 4.3Mbp, or 1.8%. It should be noted that the centromeres in the chimpanzee assembly are manually placed on the chromosome and are of an arbitrary fixed length of 3Mbp. This introduces some uncertainty in the true length of combined chimpanzee chromosomes. Human centromeres are known to range in length from 0.3Mbp to 5.0Mbp, and if the centromere on chimpanzee chromosome 2B is (in reality) at the lower end of that range, then the size difference would be easily less than 1%.

What should we see at the fusion site?

Tomkins predicts that under the fusion model “thousands of intact TTAGGG motifs in tandem should exist” yet he is fully aware that the function of telomeres is to prevent such fusions in the first place. To expect thousands of intact telomere motifs at the fusion site is to expect that intact telomeres somehow failed.

The most parsimonious explanation is that the telomeres were already missing (or severely shortened) and this allowed the fusion to occur. Lab experiments where components of the telomere nucleoprotein complex have been disabled demonstrate the ease at which head-to-head fusions occur when the telomeres are depleted.

Moteefs y’all

Jeffrey Tomkins says explicitly that forward telomere motifs (‘TTAGGG’) should only be found on the left side of the fusion site, and reverse telomere motifs (‘CCCTAA’) should only be found on the right of the fusion site.

Yet by my calculations, any 6 base pair sequence should occur entirely by chance approximately every 4,096 base pairs (4 ^ 6 = 4,096). Dr Tomkins produced a table showing the breakdown of forward and reverse motifs found both to the left and to the right of the fusion site.

On the RP11-395L14 BAC used by Tomkins, there are 108,569 base pairs to the left of the fusion site. Mathematically, I would expect 26 reverse telomeres to be present while Dr Tomkins expects zero – there are 18. To the right of the fusion site are 68,167 base pairs. I would expect 17 forward telomere motifs, Dr Tomkins expects zero – there are 18.

In your hands

As indicated by the title, these are just the low-hanging fruit; claims made by Tomkins that can be addressed quite easily. Some of the other claims are a little more complex and/or take more time to address. If there is a particular claim that anyone would like me to address, please let me know in the comments.