High memory usage with Pythons native tarfile lib

Question

I'm working in a memory constrained environment and uses a Python script with tarfile library (http://docs.python.org/2/library/tarfile.html) to continuously make backups of log files.

As the number of log files have grown (~74 000) I noticed that the system effectively kills this backup process when it runs now. I noticed that it consumes an awful lots of memory (~192mb before it gets killed by OS).

I can make a gzip tar archive ($ tar -czf) of the log files without a problem or high memory usage.

Code:

import tarfile
t = tarfile.open('asdf.tar.gz', 'w:gz')
t.add('asdf')
t.close()

The dir "asdf" consists of 74407 files with filenames of length 73. Is it not recommended to use Python's tarfile when you have a huge amount of files ?

I'm running Ubuntu 12.04.3 LTS and Python 2.7.3 (tarfile version seems to be "$Revision: 85213 $").

AFAIK tarfile is a pure-python module, so there's no surprise that it might consume quite a bit more memory than the tar command. — Bakuriu
– Bakuriu, Commented Jan 10, 2014 at 9:06
Could you show us your code? There may be a number of reasons why this is happening, as according to the documentation the TarFile class processes its data in blocks of ~(20 * 512) bytes when opened in stream mode. Do you have yours open for random access instead? (docs.python.org/2/library/tarfile.html). — Brett Lempereur
– Brett Lempereur, Commented Jan 10, 2014 at 10:00
You might in deed fare better by using the binary tar instead of that Python tar module in your case. — Alfe
– Alfe, Commented Jan 10, 2014 at 10:47
@IgnacioVazquez-Abrams I updated the question with some code, but it's just basic standard code really.. added some specs with the amount of files and filenames length.. if that matters.. — Niklas9
– Niklas9, Commented Jan 10, 2014 at 15:18

Niklas9 · Accepted Answer · 2014-01-13 13:07:06Z

5

I did some digging in the source code and it seems that tarfile is storing all files in a list of TarInfo objects (http://docs.python.org/2/library/tarfile.html#tarfile.TarFile.getmembers), causing the ever increasing memory footprint with many and long file names.

The caching of these TarInfo objects seems to have been optimized significantally in a commit from 2008, http://bugs.python.org/issue2058, but from what I can see it was only merged with py3k branch, for Python 3.

One could reset the members list again and again, as in http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/, however I'm not sure what internal tarfile functionality one misses then so I went with using a system level call instead (> os.system('tar -czf asdf.tar asdf/').

answered Jan 13, 2014 at 13:07

Niklas9

9,5168 gold badges42 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mxmlnkn Over a year ago

web.archive.org/web/20160714075947/http://blogs.it.ox.ac.uk/… In short: tar = tarfile.open(filename, 'r:gz'); [... code ...]; tar.members = [];

TheFixer · Accepted Answer · 2020-11-18 14:32:03Z

0

two ways to solve: if your VM does not have swap add and try. i have 13GB files to be tarred to a big bundle it was consistently failing. OS killed . Adding 4GB swap helped.

if you are using k8-pod, or docker container one quick workaround could be - add swap in host , capability:sys-admin or privilege mode will use host swap.

if you need tarfile with stream to avoid memory - checkout : https://gist.github.com/leth/6adb9d30f2fdcb8802532a87dfbeff77

answered Nov 18, 2020 at 14:32

TheFixer

4894 silver badges3 bronze badges

Collectives™ on Stack Overflow

High memory usage with Pythons native tarfile lib

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related