MinION fast5 to fastq

Finally, I got my first data from MinION. The first obstacle I found is the native MinION format, Fast5. Most programs require FastQ. Luckily there are poretools, which makes conversion Fast5 –> FastQ very easy.
# install poretools
sudo pip install poretools

# convert fast5 to fastq
poretools fastq fast5/ > out.fastq

If you want to achieve better basecalling accuracy, have a look at DeepNano: alternative basecaller for MinION reads. It achieves better basecalling accuracy than native MinION basecaller for both 1D (~7%) and 2D (~2%) reads.

Connecting to MySQL without passwd prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary 🙂
Create `~/.my.cnf` file:



And login without `-p` parameter:
mysql -h host -u username dbname

If you want to use `~/.my.cnf` file in MySQLdb, just connect using this:
import MySQLdb
cnx = MySQLdb.connect(host=host, port=port, read_default_file="~/.my.cnf")

Get gene names for set of RefSeq IDs

Today I needed to annotate set of RefSeq IDs in .bed file with gene names.
Firstly, I was looking for a way to get gene names for RefSeq IDs. I have found simple solution on BioStars.

mysql –user=genome -N – -A -D danRer10
-e "select name,name2 from refGene" > refSeq2gene.txt

Secondly, I’ve written simple Python script to add the gene name to .bed file in place of score which I don’t need at all.

#!/usr/bin/env python
# Add gene name instead of score to BED file

# USAGE: cat bed | refSeq2gene.txt > with_gene_names.bed

import sys

fn = sys.argv[1]
refSeq2gene = {}
for l in open(fn):
refSeq, gene = l[:-1].split(‘t’)
refSeq2gene[refSeq] = gene

sys.stderr.write(" %s accesssions loaded!n"%len(refSeq2gene))

for l in sys.stdin:
ldata = l[:-1].split(‘t’)
chrom, s, e, refSeq = ldata[:4]
if refSeq in refSeq2gene:
ldata[4] = refSeq2gene[refSeq]
ldata[4] = "-"

Hope, some will find it useful.

PhylomeDB: database for collections of gene phylogenies

PhylomeDB is a public database for complete collections of gene phylogenies (phylomes).
Users can interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees.

The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood inference, alignment trimming and evolutionary model testing. PhylomeDB includes also a public download section with the complete set of trees, alignments and orthology predictions.

PhylomeDB uses metaPhOrs (meta-Phylogeny based Orthologs) as a method for predicting orthologs and paralogs. metaPhOrs combines resources of several databases (PhylomeDB, EnsemblCompara, EggNOG, OrthoMCL, COG, Fungal Orthogroups, and TreeFam) to test robustness of each prediction.
All PhylomeDB resources are also accessible through ETE2: a Python Environment for phylogenetic Tree Exploration.

Cross-posted from BioStars.

Generating word clouds in Python

Often I needed to visualise functions associated with set of genes. word_cloud is very handy word cloud generator written in Python.
Install it
[bash]sudo pip install git+git://[/bash]

Incorporate it into your Python code

import matplotlib.pyplot as plt
from wordcloud import WordCloud

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

wordcloud = WordCloud().generate(text)

#or save as png

It’s implemented in metaPhOrs.

Selecting stable genes from RNA-Seq experiment

A friend of mine was interested in selecting stable genes from RNA-Seq experiment, stable meaning genes with expression not affected by changing conditions.
I think the easiest solution is using partitioning function implemented in cummeRbund (K-means clustering):


Of course, it’s good to play with `k` value, so you indeed get some cluster with genes that are of interest.
Then, once you have some candidates, you may want to look for genes having similar expression patterns.