Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.
[bash]
regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log
[/bash]

regex2bed.py is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py and intersect2bed.py.
[bash]
# crosslink with genes within 100 kb upstream of coding genes
awk ‘$3=="gene"’ genome.gtf > gene.gtf
cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a – -b gene.gtf | intersect2bed.py > tf.genes100k.bed
[/bash]

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.