Month: June 2016

Chromosome 2 Fusion – It’s a Binding Site. Whoopty-frikkin-do.

Both Jeff Tomkins:

“Clearly, the putative 800 base fusion site is not a degenerate fusion sequence, but a transcriptionally functional and active DNA binding motif read on the minus strand inside the DDX11L2 gene.”

Alleged Human Chromosome 2 “Fusion Site” Encodes an Active DNA Binding Domain Inside a Complex and Highly Expressed Gene—Negating Fusion

… and Cornelius Hunter over at Darwin’s God:

“Genes shouldn’t be there, regardless of expression level, and TFs shouldn’t be binding there.”

The Naked Ape: BioLogos on Human Chromosome Two

… seem to make much of the fact that the fusion site on Human chromosome 2 contains a transcription factor binding site.

The Evidence

In Tomkins’ paper, he posts an image of the UCSC Genome Browser that shows the evidence for transcription factor binding activity at the fusion site. I’ve reproduced the image below, but zoomed in on the relevant sections:

genome_browser_fusion

The first section (in red) shows where the 798 base pair fusion site sits on the chromosome. The next section shows the two DDX11L2 transcripts, transcribed from right to left. As you can see, the longer transcript completely encompasses the fusion site. So far, so good.

But what about the green bumps and the grey bars?

The Green Bumps

Without getting into the details of ChIP-seq, the bumps seen here are a good proxy for the relative strength of the binding site. A high peak signifies a strong and/or frequent bond, while a low peak signifies a weak and/or infrequent bond. The particular transcription factor here is named CTCF, and the highest peak we can see in the image above is 0.0222.

But how big is 0.0222? What do we have to compare it to? Back on the Wikipedia page for CTCF, it says that there are “anywhere between 15,000 – 40,000 CTCF binding sites” in the human genome. That would imply that there are somewhere between 1,200 and 3,100 CTCF binding sites on chromosome 2 alone. Maybe the binding site in the fusion sequence is counted among them?

No. Not even close. Fortunately the ENCODE data behind those green bumps is freely available. If you’re really keen, you can download the file here:

https://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeOpenChromChip

(Cell Line = H1-hESC; Antibody Target = CTCF (07-729); View = Peaks)

A cursory glance will tell you that a value of 0.0222 doesn’t even make it into the published data – the minimum value to be counted as a peak by ENCODE is 0.1000. If you’re not good at math, that’s 4.5 times taller than Dr Tomkins’ biggest green bump.

And how many peaks are there on chromosome 2 taller than 0.1000, you ask? Almost 6,000 of them. And how tall are they? Well, if I were to take the 1,000 tallest peaks on chromosome 2, the average height would be 2.1188. Yup, that’s almost one hundred times higher than the tallest peak in the fusion site. Can you see that little red pixel? That’s the binding site.

BindingSitesOnChr2

Kinda puts things into perspective, doesn’t it?

The Grey Bars

This is where things get a lot more interesting. The grey bars at the bottom of the image represent the collated signals for all the different transcription factors. In a similar fashion to the CTCF data above, a black bar represents a strong and/or frequent bond, while a grey bar represents a weak and/or infrequent bond.

If you dig down into the data, you’ll see that the transcription factor that causes the bar to be black is RNA Polymerase II. Great. But what is such a strong signal doing in a region that was supposedly caused by a fusion of sub-telomeric DNA? Both Dr Tomkins and Dr Hunter seem to think that transcription factor binding sites and sub-telomeric DNA are mutually exclusive.

No. No they are not.

Let’s move away from DDX11L2 for a minute and have a look at DDX11L1. It’s at the beginning of chromosome 1.

DDX11L1

Would you like to see the DNA from the binding site immediately upstream? Here it is:

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr1:10134-10362
ACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
CTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCC
TAACCCTAACCCTAACCCTACCCTAACCC

Does it remind you of anything? Telomere repeats, perhaps?

Let’s look at DDX11L5 on chromosome 9. I wonder what the binding site sequence looks like there.

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr9:9965-10327
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTAACCCTAACCCTA
ACCCTAACCCAACCCCACCCCAACCCCAACCCCAACCCAACCCTAACCCT
AACCCTAACCCAACCCTAACCCTAACCCTAACCCAACCCTCACCCTCACC
CTCACCCTCACCCTCACCCTCACCCTCACCCTAACCCTACCCTAACCCCT
AACCCCTAACCCCTAACCCCTAACCCTTAACCCTAACCCTAACCCTACCC
TAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTAACC
CTAACCCTAACCCCTAACCCCTAACCTCTAACCCTAAACCCTAAACCCTA
AACCCTAAACCCT

Yup. Looks awfully like telomere repeats. The transcription factor binding site for DDX11L9 – right at the end of chromosome 15 – looks like this:

>hg19_wgEncodeRegTfbsClusteredV2_Pol2-4H8 range=chr15:102521116-102521200
TAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTA
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGT

Should I go on, or are Dr Tomkins and Dr Hunter willing to concede the point?

Advertisements

Jeff Tomkins’ Orphan Genes – Mind The Gap!

In January 2016, Dr Jeffrey Tomkins posted an article on ICR’s website claiming that the genetic gap between humans and chimpanzees is getting wider.

This time he cites a PLoS paper titled “Origins of De Novo Genes in Human and Chimpanzee”, and makes the following comment:

In yet another recent research report, scientists describe 634 orphan genes in humans and 780 in chimpanzees. In other words, we now have a new set of 1,307 genes that are completely different between humans and chimpanzees.

Now, it’s not the elementary math error that bothers me here (I can’t say I’m surprised any more by how sloppy Tomkins can be) it is the claim that these genes are “completely different” and that they “are found in no other type of creature and therefore have no evolutionary history”.

The authors have kindly made their data available, so I thought I might test Tomkins’ claim, and – as a few people have suggested – I’ll describe the process so that any of you can replicate the results if you are so inclined.

If you’re not so inclined you can probably skip to the end.

Step 1 – Where Are These Genes?

If you scroll down about halfway in the Ruiz-Orera paper, you’ll see links to two GTF files:

For now we’ll just be searching for the human genes; try to find a homolog in the chimpanzee genome. So click on the human link, and download the file – hsa_denovo.gtf.

If you have a look inside these files you’ll see the coordinates for each coding sequence / exon … but no DNA sequence. Sad Panda.

Step 2 – Get The DNA Sequence. Obviously.

Now this part isn’t trivial. I wrote some code in C++ that took this GTF file as input, parsed out the coordinates for each sequence, carved that sequence from the human genome, and then created one FASTA file per chromosome.

If you’re interested in looking at the code, you will find it here:

https://github.com/NambyPamby/MindTheGap

I decided to exclude any sequences shorter than 30 base pairs because short sequences will almost certainly find a match somewhere on the corresponding chimpanzee chromosome, and that would only serve to inflate the final result.

Step 3 – BLAST Away!

For each FASTA file we run a BLAST against the corresponding chimpanzee chromosome – ‘run.sh‘ takes care of this by calling ‘blast_chrome.sh‘ for each FASTA file. Depending on how powerful your machine is, this part might take a few hours. At the end you should have 24 CSV files with the results – one for each of the autosomes, and then two more for the X and Y chromosomes.

Step 4 – Make Sense Of The Results

The ‘analyse.sh‘ script reads in these CSV files and prints out some statistics for each chromosome. The two most important columns are the number of nucleotides I queried and the number of identical nucleotides I actually found. I also count the number of queries I submitted and how many results I got.

Drum Roll, Please.

Who would have thunk it? These human genes which supposedly had no evolutionary history have corresponding sequence in the chimpanzee genome which is – on average – about 95.41% identical.

MindTheGap

But why is this post critical of Tomkins and not of the PLoS paper? I suggest reading the PLoS paper first, then reading Tomkins article, and let me know in the comments just how sloppy Tomkins is in his interpretations and conclusions.

Where did that guy get his data from?

Over at Evolution News and Views, I got a shout out from Ann Gauger with a tone that somewhat suggests that my data might be wrong. This post is for her.

Dear Ann,

BLASTN 2.2.30+


Query= 8 dna:chromosome chromosome:Galgal4:8:17537065:17580564:1

Length=43500

Subject= 1 dna:chromosome chromosome:GRCh38:1:78700000:78800000:-1

Length=100001


 Score =   291 bits (322),  Expect = 7e-79
 Identities = 361/483 (75%), Gaps = 22/483 (5%)
 Strand=Plus/Plus

Query  620    AATTATGAAAGCATACTTTTC-AGTGGTATTCCAGAGAAAGGACTTGCAAGAACTGGAAT  678
              || ||||||||  ||| |||| ||||   ||||| | | || ||||  |||| |||| ||
Sbjct  10940  AAATATGAAAG--TACATTTCTAGTGTATTTCCACA-ACAGTACTTAGAAGACCTGGGAT  10996

Query  679    AAGGATAAGAAGTGAAGTGGAAATTAGTGGTATTGGACCAAAACTTTGTCTTATTAGGGT  738
                  | || |||| |||| ||| | | ||| |  |  |  ||||   || |||||| |||
Sbjct  10997  GTAAACAAAAAGTAAAGTAGAAGTCACTGGCACAGATCTGAAACCAAGTTTTATTAAGGT  11056

Query  739    AAGAAAATTTATTCAATTTGAAAGGTAAAATTCTCTTGATACCAGTTTGTTGGGTTTTTT  798
              |||||||| ||| ||||||||||| |  ||||  |||        |||| |    |||| 
Sbjct  11057  AAGAAAATATATCCAATTTGAAAGCTGGAATTACCTTC-------TTTGATAACGTTTTC  11109

Query  799    TTTTTAAGCTTTTGGGAAGTAATTAAGTTTCATCATATGTTGTGCTTACTCAGGCAGAAT  858
                   ||||||| |  || |||||||||||||||||||||| |||||||||| |||||||
Sbjct  11110  -----AAGCTTTGGACAAATAATTAAGTTTCATCATATGTTTTGCTTACTCATGCAGAAT  11164

Query  859    GTAACTAACACTACTGTTTTTTTATT-CAGTGCTCTAAATTCTATTTG-CACTTT-GCCA  915
              ||||||||  ||  | |||||||| | |||    |||||||| ||||  |||| | ||||
Sbjct  11165  GTAACTAAGTCTTTTTTTTTTTTAATGCAGAAGCCTAAATTCCATTTCACACTGTAGCCA  11224

Query  916    GGTAATTCTCAGCTCAAGCCAACCTTGGGCTTGAAGGATTTCTTCTGCTTTGTGGCCAGG  975
              |  |||||||||||||||| |||||||||||||| |||||||||||||||||||  ||||
Sbjct  11225  GACAATTCTCAGCTCAAGCTAACCTTGGGCTTGAGGGATTTCTTCTGCTTTGTGCTCAGG  11284

Query  976    GAGACAATGGAATGTAATTTGAAATGCACAGTAATTTGTTATTGGATCAATCCAATTGTT  1035
              |||||||| ||   ||||||  ||||||||| | ||| ||    ||||||| ||||||| 
Sbjct  11285  GAGACAATAGAGCATAATTTTGAATGCACAGCAGTTTATTCCAAGATCAATTCAATTGTA  11344

Query  1036   CC-AAACTGTACAACCTAGGATTATTTAATCAACTGATTTCGTAGCCAGCAAACGAAAGG  1094
              || ||||  || ||| |  |||||||||||||||| ||||| ||  ||| |||| ||| |
Sbjct  11345  CCAAAACCATATAACTTTAGATTATTTAATCAACTTATTTCATAAGCAG-AAAC-AAATG  11402

Query  1095   CAA  1097
              |||
Sbjct  11403  CAA  11405


 Score =   223 bits (246),  Expect = 3e-58
 Identities = 287/390 (74%), Gaps = 31/390 (8%)
 Strand=Plus/Plus

Query  24111  TTTTTCTTGTTAAAGGACCAGAGCTGGCTCTTGCAGCCTATTTTTACAGTACCGTGTGAT  24170
              |||||  |||||||  ||||| ||   |||| || |||||||||||   | | || | ||
Sbjct  41719  TTTTTAATGTTAAAACACCAGCGCCAACTCTGGCTGCCTATTTTTATTATGCTGTATAAT  41778

Query  24171  TCTGCAGACATTGACATGTGTCACCTGTGATGCAGCTACATTTGTCG-GCTCTCTGTGCT  24229
              ||  |||  || ||||||||||||||||| |||||  ||||||| |  ||||| ||||||
Sbjct  41779  TCCACAGGTATCGACATGTGTCACCTGTGTTGCAGTCACATTTGGCCAGCTCTTTGTGCT  41838

Query  24230  CAACAGGGAGGAATCGATCTTCTACTTTCATTAGGTGGCAGGAGTAGACTATTGGCATAA  24289
              ||  ||||| | ||| ||||||  ||||||||| |||| || | ||||| ||||||||||
Sbjct  41839  CACTAGGGAAGCATCAATCTTCAGCTTTCATTAAGTGGTAGAAATAGACCATTGGCATAA  41898

Query  24290  AAAAT--ACTAAAAAAAAAAATGGAAAAGAAACCCCAGGGCTTTCTGCTTGGAGAC--CC  24345
              |||||    |  ||||| ||||||   ||||          ||||||| ||| |||  | 
Sbjct  41899  AAAATTATTTTTAAAAATAAATGG--GAGAA---------TTTTCTGCCTGGGGACTACA  41947

Query  24346  AGACTGCTGTTCAGTGGTCATTTGAATTATTGAATGGGATTAAAATAAAAGCCATTTC--  24403
                ||| ||||||  | || ||||||||||||||||||  |  ||||||||||||||||  
Sbjct  41948  CCACTACTGTTCTCTAGTAATTTGAATTATTGAATGGACT--AAATAAAAGCCATTTCTA  42005

Query  24404  --CTTTTT----TTTGTCCCTTTACTCAGATGAGCCATCTGAAATGCAAGTTGATTTGT-  24456
                ||||||    ||| |   ||| | ||||  |   | ||||||||||||||||||| | 
Sbjct  42006  TTCTTTTTATTCTTTTTTTTTTTGCCCAGACAA---ACCTGAAATGCAAGTTGATTTTTT  42062

Query  24457  -ATTTTCTTTTATCCCCTCAGCTTGTTAGG  24485
               |   |||||||||||||||||||||||||
Sbjct  42063  AAAAATCTTTTATCCCCTCAGCTTGTTAGG  42092


 Score =   219 bits (242),  Expect = 4e-57
 Identities = 269/362 (74%), Gaps = 19/362 (5%)
 Strand=Plus/Plus

Query  41677  AAAATTATGAAAGCTTTACCCATTCTCATTATGCAAACTTAATCTAAATGGATGTCCTAA  41736
              ||||| | | ||||||| | ||||||  ||| |||||| ||||||||| | | |  ||||
Sbjct  84967  AAAATCAGGGAAGCTTTGCACATTCTAGTTACGCAAACGTAATCTAAACGAAGGCTCTAA  85026

Query  41737  TATTCTTCCAGATGCAACGATAAACCTCCAACTATCTAAGAATATTTATTGGGAGGATGA  41796
              || |||||   || |   |||||||||||||  |  | ||||| |||||||| |||||||
Sbjct  85027  TACTCTTCTCTATACCTTGATAAACCTCCAAACAGTTGAGAATTTTTATTGGAAGGATGA  85086

Query  41797  GTATTATTATTCCATTTAGATTATTTCAGCATTAAGGGATATGGCTTATTCAAGCTGCTG  41856
               |  |||||||||||||||           ||||| |||||||||||||||| ||||| |
Sbjct  85087  ATCATATTATTCCATTTAG-----------ATTAAAGGATATGGCTTATTCAGGCTGCAG  85135

Query  41857  TTGATAAAGCAATGTGGTAAGGTTAATACTGCAACTGAC-AATGCTCTGCCAGATTTCAC  41915
               | |||||||   || || || ||||| ||  ||||||| |||||  |||||||||| | 
Sbjct  85136  ATAATAAAGCCGCGTAGTCAGCTTAATGCTAAAACTGACAAATGCGGTGCCAGATTTGAA  85195

Query  41916  AATATATGGCAAACTTTAATTAGAAGTTTATGAACCTCTGAAAATTCTCCGAAGGGCTTA  41975
              || |||  | ||| ||||||| ||||||||||||  ||||||||  |||| ||||| |||
Sbjct  85196  AACATACAGGAAATTTTAATTGGAAGTTTATGAAAGTCTGAAAA-ACTCCAAAGGGTTTA  85254

Query  41976  TCCTTCAGGATGAACTT-CGACAAA--TAGTCAGCTGAAATATGCAGTGATATGCA-GCA  42031
              |||||    || |||||   |||||  |||  |||| | ||||    ||||||||| |||
Sbjct  85255  TCCTTAGAAATAAACTTAAAACAAAATTAG--AGCTAATATATTAGCTGATATGCAGGCA  85312

Query  42032  AT  42033
              ||
Sbjct  85313  AT  85314


 Score =   102 bits (112),  Expect = 7e-22
 Identities = 219/321 (68%), Gaps = 40/321 (12%)
 Strand=Plus/Plus

Query  42618  AATGCATTATGTACAGTCTGCACTGCTTAATAAATATGTTGTTCATTAAATAAGTATTCA  42677
              ||||||||||||||||||||   ||||||||||||   ||  |||| |||||   |||| 
Sbjct  85882  AATGCATTATGTACAGTCTG---TGCTTAATAAATGCATTACTCATAAAATATACATTCG  85938

Query  42678  TGTGGTCTCCCTCTTTATTTTTCCATATCAACAAAACAATCAGACCAACTATAATATTAT  42737
                 |  ||||          || | |||||| | | ||||||  |  |  |||| |||||
Sbjct  85939  CAAGTCCTCC----------TTTCTTATCAATAGAGCAATCAAGCTGAGCATAAGATTAT  85988

Query  42738  CCAGAAATTCTGCTTCTTTTTAT-CTGAAATATAATTATAGCAGTCCTCTCTTAAAATTA  42796
                ||||||||  || |||  ||| |  ||||||||| | |  | |||||| ||||||| |
Sbjct  85989  TGAGAAATTCAACTCCTTCCTATACAAAAATATAATCACATTA-TCCTCTTTTAAAATCA  86047

Query  42797  TGTTCACTAGGTGATGAAAGGAAA----ACATGATTACAGCTACTGCTAACATTCCATTG  42852
              || |||      ||||||| ||||    | ||| ||  ||||     ||| ||| |||||
Sbjct  86048  TGCTCA------GATGAAAAGAAACAAGAGATGGTT--AGCT-----TAATATTTCATTG  86094

Query  42853  TGTAAAGAATTTCATATTTAGCATACTAAAGACACAGTAAGCATTTGTTTTCTTTTATGT  42912
              |  |||| |||| ||||||| | |||| || | ||  ||  ||||| |||||||| ||||
Sbjct  86095  TAAAAAGCATTTTATATTTAACCTACTCAAAATAC--TA--CATTT-TTTTCTTTCATGT  86149

Query  42913  TTTCTGGTA---AAGAGAAAA  42930
              || |  | |   |||||||||
Sbjct  86150  TTCCCAGCATAGAAGAGAAAA  86170


 Score = 57.2 bits (62),  Expect = 3e-08
 Identities = 45/54 (83%), Gaps = 0/54 (0%)
 Strand=Plus/Plus

Query  21281  AATACACTCTGGAAAACTATTGTAGCCAGATGCCAAACAAATGAAAACAGATGG  21334
              |||| || |||  |||| |||||| | | || ||||||||||||||||||||||
Sbjct  40402  AATATACCCTGTGAAACAATTGTAACTAAATACCAAACAAATGAAAACAGATGG  40455


 Score = 51.8 bits (56),  Expect = 1e-06
 Identities = 173/264 (66%), Gaps = 14/264 (5%)
 Strand=Plus/Plus

Query  19077  CATTTCATAAGTTGGTATTATAGAATATGACTG-AATGTAAATGATATAATAGTAGCAAT  19135
              ||||  || || || || || ||||| | |  | |||| ||||||| |    |  |||||
Sbjct  36099  CATTAGATGAGCTGATACTACAGAATGTTAAAGGAATGCAAATGATCTGGCTGGTGCAAT  36158

Query  19136  AAGAAAAATAGCATAGTCACTGCGCATTGAGCA--GGTACTTATAATTTTGCCAATTAAT  19193
                ||||||||  ||| |  |   | | ||| ||  || |||  ||||| ||    |||||
Sbjct  36159  TGGAAAAATATTATAATTTCCATGAAATGAACAAAGGGACTC-TAATTATGTATGTTAAT  36217

Query  19194  AGTTCAAAAAGGCAGAATATTCTGTTTATGGCTGTTTTATAATATTGGTTTTGTAGTTGA  19253
              ||  |||| | ||| | | ||||| |||||   |||| || | ||  ||||| | |||||
Sbjct  36218  AGGACAAAGATGCACAGTGTTCTGCTTATGATGGTTTAATGACATCAGTTTTATGGTTGA  36277

Query  19254  TTTTTCATATA-ATCTTGTATCAG--------TTGTGTTATGCAATTAGTAAACTGATAC  19304
                ||||  | |  |||  ||||||        |  | ||| |||||||||||  |||| |
Sbjct  36278  GATTTCCAACATTTCTCATATCAGACTTTATATCATATTAGGCAATTAGTAACTTGATTC  36337

Query  19305  AATCTGCA-AATGTGGCTTTAAAA  19327
              |  ||||| |  |||| |||||||
Sbjct  36338  ACACTGCACAGGGTGGTTTTAAAA  36361

You’re welcome.