Both Jeff Tomkins:
“Clearly, the putative 800 base fusion site is not a degenerate fusion sequence, but a transcriptionally functional and active DNA binding motif read on the minus strand inside the DDX11L2 gene.”
… and Cornelius Hunter over at Darwin’s God:
“Genes shouldn’t be there, regardless of expression level, and TFs shouldn’t be binding there.”
… seem to make much of the fact that the fusion site on Human chromosome 2 contains a transcription factor binding site.
In Tomkins’ paper, he posts an image of the UCSC Genome Browser that shows the evidence for transcription factor binding activity at the fusion site. I’ve reproduced the image below, but zoomed in on the relevant sections:
The first section (in red) shows where the 798 base pair fusion site sits on the chromosome. The next section shows the two DDX11L2 transcripts, transcribed from right to left. As you can see, the longer transcript completely encompasses the fusion site. So far, so good.
But what about the green bumps and the grey bars?
The Green Bumps
Without getting into the details of ChIP-seq, the bumps seen here are a good proxy for the relative strength of the binding site. A high peak signifies a strong and/or frequent bond, while a low peak signifies a weak and/or infrequent bond. The particular transcription factor here is named CTCF, and the highest peak we can see in the image above is 0.0222.
But how big is 0.0222? What do we have to compare it to? Back on the Wikipedia page for CTCF, it says that there are “anywhere between 15,000 – 40,000 CTCF binding sites” in the human genome. That would imply that there are somewhere between 1,200 and 3,100 CTCF binding sites on chromosome 2 alone. Maybe the binding site in the fusion sequence is counted among them?
No. Not even close. Fortunately the ENCODE data behind those green bumps is freely available. If you’re really keen, you can download the file here:
(Cell Line = H1-hESC; Antibody Target = CTCF (07-729); View = Peaks)
A cursory glance will tell you that a value of 0.0222 doesn’t even make it into the published data – the minimum value to be counted as a peak by ENCODE is 0.1000. If you’re not good at math, that’s 4.5 times taller than Dr Tomkins’ biggest green bump.
And how many peaks are there on chromosome 2 taller than 0.1000, you ask? Almost 6,000 of them. And how tall are they? Well, if I were to take the 1,000 tallest peaks on chromosome 2, the average height would be 2.1188. Yup, that’s almost one hundred times higher than the tallest peak in the fusion site. Can you see that little red pixel? That’s the binding site.
Kinda puts things into perspective, doesn’t it?
The Grey Bars
This is where things get a lot more interesting. The grey bars at the bottom of the image represent the collated signals for all the different transcription factors. In a similar fashion to the CTCF data above, a black bar represents a strong and/or frequent bond, while a grey bar represents a weak and/or infrequent bond.
If you dig down into the data, you’ll see that the transcription factor that causes the bar to be black is RNA Polymerase II. Great. But what is such a strong signal doing in a region that was supposedly caused by a fusion of sub-telomeric DNA? Both Dr Tomkins and Dr Hunter seem to think that transcription factor binding sites and sub-telomeric DNA are mutually exclusive.
No. No they are not.
Let’s move away from DDX11L2 for a minute and have a look at DDX11L1. It’s at the beginning of chromosome 1.
Would you like to see the DNA from the binding site immediately upstream? Here it is:
>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr1:10134-10362 ACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC CTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACC CTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCC TAACCCTAACCCTAACCCTACCCTAACCC
Does it remind you of anything? Telomere repeats, perhaps?
Let’s look at DDX11L5 on chromosome 9. I wonder what the binding site sequence looks like there.
>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr9:9965-10327 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTAACCCTAACCCTA ACCCTAACCCAACCCCACCCCAACCCCAACCCCAACCCAACCCTAACCCT AACCCTAACCCAACCCTAACCCTAACCCTAACCCAACCCTCACCCTCACC CTCACCCTCACCCTCACCCTCACCCTCACCCTAACCCTACCCTAACCCCT AACCCCTAACCCCTAACCCCTAACCCTTAACCCTAACCCTAACCCTACCC TAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTAACC CTAACCCTAACCCCTAACCCCTAACCTCTAACCCTAAACCCTAAACCCTA AACCCTAAACCCT
Yup. Looks awfully like telomere repeats. The transcription factor binding site for DDX11L9 – right at the end of chromosome 15 – looks like this:
>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr15:102521116-102521200 TAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGT
Should I go on, or are Dr Tomkins and Dr Hunter willing to concede the point?