Tracing exceptions in multiprocessing in Python

I had problems with debugging my programme using multiprocessing.Pool.
[python]
Traceback (most recent call last):
File "src/homologies2mysql_multi.py", line 294, in <module>
main()
File "src/homologies2mysql_multi.py", line 289, in main
o.noupload, o.verbose)
File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
raise value
ValueError: need more than 1 value to unpack
[/python]

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:
[python]
import traceback, functools, multiprocessing

def trace_unhandled_exceptions(func):
@functools.wraps(func)
def wrapped_func(*args, **kwargs):
try:
return func(*args, **kwargs)
except:
print ‘Exception in ‘+func.__name__
traceback.print_exc()
return wrapped_func

@trace_unhandled_exceptions
def go():
print(1)
raise Exception()
print(2)

p = multiprocessing.Pool(1)

p.apply_async(go)
p.close()
p.join()
[/python]

The error message will look like:
[python]
1
Exception in go
Traceback (most recent call last):
File "<stdin>", line 5, in wrapped_func
File "<stdin>", line 4, in go
Exception
[/python]

Solution found on StackOverflow.

Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.
[bash]
xlsx2tsv.py *.xlsx
[/bash]

Installing new version of Python without root

Some time ago I was recommending to use Python virtual environment to install local version of Python packages. However this will not solve the issue of outdated version Python in the server your are working in. Here, pythonbrew may be help for you.

[bash]
# install pythonbrew to ~/.pythonbrew
curl -kL http://xrl.us/pythonbrewinstall | bash

# add to ~/.bashrc to automatically activate pythonbrew
[[ -s "$HOME/.pythonbrew/etc/bashrc" ]] && source "$HOME/.pythonbrew/etc/bashrc"

# open new terminal tab (Ctrl+Shift+T) or window (Ctrl+Shift+N)

# install python 2.7.10
pythonbrew install 2.7.10

# and enable the new version
pythonbrew switch 2.7.10

# from now on, you can enjoy the version of your choice and install dependencies
which python
#/home/…/.pythonbrew/pythons/Python-2.7.10/bin/python
python –version
#Python 2.7.10
which pip
#/home/…/.pythonbrew/pythons/Python-2.7.10/bin/pip
[/bash]

Serving IPython notebook on public domain

I’ve been involved in teaching basic programming in Python. There are several good tutorials and on-line courses (just to mention Python@CodeCademy), but I’ve recognised there is a need for some interactive workplace for the students. I’ve got an idea to setup IPython in public domain, as many of the students don’t have Python installed locally or miss certain dependencies…
The task of installing IPython and serving it in publicly seems very easy… But I’ve encountered numerous difficulties on the way, caused by different versions of IPython (ie. split into Jupyter in v4), Apache configuration and firewall setup, just to mention a few. Anyway, I’ve succeeded and I’ve decided to share my experiences here 🙂
First of all, I strongly recommend setting up separate user for serving IPython, as only this way your personal files will be safe 😉

  1. Install IPython notebook and prepare new user
  2. [bash]
    # install python-dev and build essentials
    sudo apt-get install build-essential python-dev

    # install ipython; v3 is recommended
    sudo pip install ipython[all]==3.2.1

    # create new user
    sudo adduser ipython

    # login as new user
    su ipython
    [/bash]

  3. Configure IPython notebook
  4. [bash]
    # create new profile
    ipython profile create nbsever

    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: ‘sha1:[your hashed password here]’

    # add to ~/.ipython/profile_nbserver/ipython_notebook_config.py after `c = get_config()`
    c.NotebookApp.ip = ‘localhost’
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8889
    c.NotebookApp.base_url = ‘/ipython’
    c.NotebookApp.password = u’sha1:[your hashed password here]’
    c.NotebookApp.allow_origin=’*’

    # create some directory for notebook files ie. ~/Public/ipython
    mkdir -p ~/Public/ipython
    cd ~/Public/ipython

    # start notebook server
    ipython notebook –profile=nbserver
    [/bash]

  5. Configure Apache2
  6. [bash]
    # enable mods
    sudo a2enmod proxy proxy_http proxy_wstunnel
    sudo service apache2 restart

    # add ipython proxy config to your enabled site ie. /etc/apache2/sites-available/000-default.conf
    # IPython
    <Location "/ipython" >
    ProxyPass http://localhost:8889/ipython
    ProxyPassReverse http://localhost:8889/ipython
    </Location>

    <Location "/ipython/api/kernels/" >
    ProxyPass ws://localhost:8889/ipython/api/kernels/
    ProxyPassReverse ws://localhost:8889/ipython/api/kernels/
    </Location>
    #END

    # restart apache2
    sudo service apache2 restart
    [/bash]

Your public IPython will be accessible at http://yourdomain.com/ipython .
The longest time it took me to realise that c.NotebookApp.allow_origin='*' line is crucial in IPython notebook configuration, otherwise the kernel is loosing connection with an error ‘Connection failed‘ or ‘WebSocket error‘. Additionally, in one of the servers I’ve been trying, there is proxy setup that block some ports high ports, thus it was impossible to connect to WebSocket even with ApacheProxy setup…
If you want to read more especially about setting SSL-enabled notebook, have a look at jupyter documentation.

Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.
[bash]
regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log
[/bash]

regex2bed.py is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py and intersect2bed.py.
[bash]
# crosslink with genes within 100 kb upstream of coding genes
awk ‘$3=="gene"’ genome.gtf > gene.gtf
cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a – -b gene.gtf | intersect2bed.py > tf.genes100k.bed
[/bash]

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.

Virtual environment (venv) in Python

Working on the machines with no root access is sometimes annoying, especially if you depend on multiple Python packages and your distro is somehow outdated… You may find Python virtual environment (venv) very useful.
First, you need to create venv directory structure (this is done only once):
[bash]
mkdir -p ~/src/venv
cd ~/src/venv
virtualenv py27
[/bash]

Then you can open new BASH terminal and activate your venv by:
[bash]source ~/src/venv/py27/bin/activate[/bash]

After that, you can install / upgrade any packages using pip / easy_install (even including PIP ) ie.
[bash]
pip install –upgrade pip
pip install –upgrade scipy
[/bash]

Insipired by python-guide.

Generating word clouds in Python

Often I needed to visualise functions associated with set of genes. word_cloud is very handy word cloud generator written in Python.
Install it
[bash]sudo pip install git+git://github.com/amueller/word_cloud.git[/bash]

Incorporate it into your Python code

[python]
import matplotlib.pyplot as plt
from wordcloud import WordCloud

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

wordcloud = WordCloud().generate(text)
img=plt.imshow(wordcloud)
plt.axis("off")
plt.show()

#or save as png
img.write_png("wordcloud.png")
[/python]

It’s implemented in metaPhOrs.