Working efficiently with millions of files

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:
# index content of multiple tar archives -v -i db_*/*.tar -d archives.db3

# search for some_file in mutliple archives -v -f some_file -d archives.db3

import sqlite3, time
import tarfile, zlib, cStringIO

# lookup function
def tar_lookup(dbpath, file_name):
"""Return file name inside tar, tar file name, offset and file size."""
cur = sqlite3.connect(dbpath).cursor()
cur.execute("""SELECT o.file_name, f.file_name, offset, file_size
FROM offset_data as o JOIN file_data as f ON o.file_id=f.file_id
WHERE o.file_name like ?""", (file_name,))
return cur.fetchall()

# saving to archive
# open tarfile
tar =, "w")
# save files to tar
for fname, txt in files_generator:
# compress file content (optionally)
gztxt = zlib.compress(txt)
# get tarinfo
ti = tarfile.TarInfo(fname)
ti.size = len(gztxt)
ti.mtime = time.time()
# add to tar
tar.addfile(ti, cStringIO.StringIO(gztxt))

# reading from indexed archive(s)
# NOTE: before you need to run on your archives
tarfnames = tar_lookup(index_path, file_name)
for i, (name, tarfn, offset, file_size) in enumerate(tarfnames, 1):
tarf = open(tarfn)
# move pointer to right archive place
# read tar fragment & uncompress
txt = zlib.decompress(

TAR random access

I was often challenged with accessing thousands/millions files from network file system (NFS). As I update some of the stored files once in a while, I have decided to store these files in multiple TAR archives.  The data complexity was therefore reduced. But still, there was an issue with random access to the files within each archive.

First, I had a look at tar indexer. Its simplicity is brilliant. Yet, it stores index in raw text file and it can handle only single tar file. Therefore, I have ended up writing my own tar_indexer tool using sqlite3 for index storing and allowing indexing of multiple tar archives. This can be easily incorporated into any Python project.

Note, only raw (uncompressed) tar files are accepted as native tar.gz cannot be random accessed. But you can compress each file using zlib before adding it to tar. At least, this is what I do.

Hope, someone will find it useful:)