Naive python implementation of a de Bruijn Graph

Neat implementation of simple de Bruijn assembler in Python

Bits of Bioinformatics

De Bruijn graphs are widely used in assembly algorithms and I guess writing your own assembler is a rite of passage. Hoping to get some answers I got this question from Mick Watson. Since I had a script lying around that I used for validation I thought I would share it.

So let’s see how this is done. First we’ll need to work with k-mers, which are substrings of a fixed length k. Since we’re keeping this as simple as possible we’ll use python strings for k-mers. Some functions to work with those

The yield statement in python gives us an iterator object, so we can write stuff like

which will print TATA, TATC, TATG and TATT, i.e. all forward neighbors of the k-mer “ATAT”. If we need to convert the iterator to a list the easiest thing to do is list(fw("ATAT")).

To keep track of all the k-mers…

View original post 657 more words

Advertisements

MinION fast5 to fastq

Finally, I got my first data from MinION. The first obstacle I found is the native MinION format, Fast5. Most programs require FastQ. Luckily there are poretools, which makes conversion Fast5 –> FastQ very easy.
[bash]
# install poretools
sudo pip install poretools

# convert fast5 to fastq
poretools fastq fast5/ > out.fastq
[/bash]

If you want to achieve better basecalling accuracy, have a look at DeepNano: alternative basecaller for MinION reads. It achieves better basecalling accuracy than native MinION basecaller for both 1D (~7%) and 2D (~2%) reads.

Using docker for application development

I found Docker super useful, but going through a manual is quite time consuming. Here, very stripped manual to create your first image and push it online 🙂

[bash]
# install docker
wget -qO- https://get.docker.com/ | sh

# add your user to docker group
sudo usermod -aG docker $USER

# check if it’s working
docker run docker/whalesay cowsay "hello world!"

# create an account on https://hub.docker.com
# and login
docker login -u $USER –email=EMAIL

# run image
docker run -it ubuntu

# make some changes ie. create user, install needed software etc

# finally open new terminal & commit changes (SESSIONID=HOSTNAME)
docker commit SESSIONID $USER/image:version

# mount local directory `pwd`/test as /test in read/write mode
docker run -it -v `pwd`/test:/test:rw $USER/image:version some command with arguments

# push image
docker push $USER/image:version
[/bash]

From now, you can get your image from any other machine connected to Internet by executing:
[bash]
docker run -it $USER/image:version
# ie. redundans image
docker run -it -w /root/src/redundans lpryszcz/redundans:v0.11b ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1

# you can create alias latest, then version can be skipped on running
docker tag lpryszcz/redundans:v0.11b lpryszcz/redundans:latest
docker push lpryszcz/redundans:latest

docker run -it lpryszcz/redundans
[/bash]

You can add info about your repository at https://hub.docker.com/r/$USER/image/

Working efficiently with millions of files

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:
BASH
[bash]
# index content of multiple tar archives
tar2index.py -v -i db_*/*.tar -d archives.db3

# search for some_file in mutliple archives
tar2index.py -v -f some_file -d archives.db3
[/bash]

Python
[python]
import sqlite3, time
import tarfile, zlib, cStringIO

###
# lookup function
def tar_lookup(dbpath, file_name):
"""Return file name inside tar, tar file name, offset and file size."""
cur = sqlite3.connect(dbpath).cursor()
cur.execute("""SELECT o.file_name, f.file_name, offset, file_size
FROM offset_data as o JOIN file_data as f ON o.file_id=f.file_id
WHERE o.file_name like ?""", (file_name,))
return cur.fetchall()

###
# saving to archive
# open tarfile
tar = tarfile.open(tarpath, "w")
# save files to tar
for fname, txt in files_generator:
# compress file content (optionally)
gztxt = zlib.compress(txt)
# get tarinfo
ti = tarfile.TarInfo(fname)
ti.size = len(gztxt)
ti.mtime = time.time()
# add to tar
tar.addfile(ti, cStringIO.StringIO(gztxt))

###
# reading from indexed archive(s)
# NOTE: before you need to run tar2index.py on your archives
tarfnames = tar_lookup(index_path, file_name)
for i, (name, tarfn, offset, file_size) in enumerate(tarfnames, 1):
tarf = open(tarfn)
# move pointer to right archive place
tarf.seek(offset)
# read tar fragment & uncompress
txt = zlib.decompress(tarf.read(file_size))
[/python]

Tracing exceptions in multiprocessing in Python

I had problems with debugging my programme using multiprocessing.Pool.
[python]
Traceback (most recent call last):
File "src/homologies2mysql_multi.py", line 294, in <module>
main()
File "src/homologies2mysql_multi.py", line 289, in main
o.noupload, o.verbose)
File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
raise value
ValueError: need more than 1 value to unpack
[/python]

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:
[python]
import traceback, functools, multiprocessing

def trace_unhandled_exceptions(func):
@functools.wraps(func)
def wrapped_func(*args, **kwargs):
try:
return func(*args, **kwargs)
except:
print ‘Exception in ‘+func.__name__
traceback.print_exc()
return wrapped_func

@trace_unhandled_exceptions
def go():
print(1)
raise Exception()
print(2)

p = multiprocessing.Pool(1)

p.apply_async(go)
p.close()
p.join()
[/python]

The error message will look like:
[python]
1
Exception in go
Traceback (most recent call last):
File "<stdin>", line 5, in wrapped_func
File "<stdin>", line 4, in go
Exception
[/python]

Solution found on StackOverflow.

Conflicting config for htop on machines sharing same /home directory

My friend spotted a problem with htop configuration. Simply when htop was executed on two different Ubuntu distros (10.04 and 14.04) the config was reset.
After some interrogation, we have spotted that 10.04 stores htop config to ~/.htoprc, while 14.04 to ~/.config/htop/htoprc. It was enough to remove one of them and link another one as below:
[bash]
rm .htoprc
ln -s .config/htop/htoprc .htoprc
[/bash]

Connecting to MySQL without passwd prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary 🙂
Create `~/.my.cnf` file:

[bash]
[client]
user=username
password="pass"

[mysql]
user=username
password="pass"
[/bash]

And login without `-p` parameter:
[bash]
mysql -h host -u username dbname
[/bash]

If you want to use `~/.my.cnf` file in MySQLdb, just connect using this:
[python]
import MySQLdb
cnx = MySQLdb.connect(host=host, port=port, read_default_file="~/.my.cnf")
[/python]