Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.
[bash]
regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log
[/bash]

regex2bed.py is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py and intersect2bed.py.
[bash]
# crosslink with genes within 100 kb upstream of coding genes
awk ‘$3=="gene"’ genome.gtf > gene.gtf
cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a – -b gene.gtf | intersect2bed.py > tf.genes100k.bed
[/bash]

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s