Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons (bam2cov.py). The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

[python]
# populate intervals
c2i[chrom] = [(start, end, strand, name), ]

# select intervals
ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
[/python]

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

[python]
import numpy as np
dtype = np.dtype({‘names’: [‘start’, ‘end’, ‘strand’, ‘entry_id’],
‘formats’: [‘uint32’, ‘uint32’, ‘bool_’, ‘uint32’]})
# populate intervals
chr2intervals[chrom] = np.array(data, dtype=dtype)

# select intervals
ivals = c2i[chrom][np.any([np.all([ c2i[chrom][‘start’]>=s, c2i[chrom][‘start’]<=e ], axis=0),
np.all([ c2i[chrom][‘end’] >=s, c2i[chrom][‘end’] <=e ], axis=0),
np.all([ c2i[chrom][‘start’]< s, c2i[chrom][‘end’] > e ], axis=0)], axis=0)]
[/python]

Benchmarking code:
[python]
import numpy as np
from datetime import datetime
from bam2cov import load_intervals

# load data
chr2intervals, entries = load_intervals(‘stranded/DANRE.gtf’, 0)

# define chromosome name and length
chrom, chrlen = ‘1’, 60348388

# test
sample = random.sample(xrange(0, chrlen), 10000)
counts = np.zeros(entries, dtype=’uint’)
t0 = datetime.now()
for ii,s in enumerate(sample, 1):
if not ii%1000:
print ii
e = s+100
# filter from list – uncomment if needed
#selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
# select from numpy.array
selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c][‘start’], c2i[chrom][/c][‘start’]<e], axis=0), np.all([s<c2i[chrom][/c][‘stop’], c2i[chrom][/c][‘stop’]<e], axis=0)], axis=0)]
for s, e, strand, i in selected:
counts[i] += 1

dt=datetime.now()-t0
print dt, counts.sum()
[/python]

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s