3

I'm working in a memory constrained environment and uses a Python script with tarfile library (http://docs.python.org/2/library/tarfile.html) to continuously make backups of log files.

As the number of log files have grown (~74 000) I noticed that the system effectively kills this backup process when it runs now. I noticed that it consumes an awful lots of memory (~192mb before it gets killed by OS).

I can make a gzip tar archive ($ tar -czf) of the log files without a problem or high memory usage.

Code:

import tarfile
t = tarfile.open('asdf.tar.gz', 'w:gz')
t.add('asdf')
t.close()

The dir "asdf" consists of 74407 files with filenames of length 73. Is it not recommended to use Python's tarfile when you have a huge amount of files ?

I'm running Ubuntu 12.04.3 LTS and Python 2.7.3 (tarfile version seems to be "$Revision: 85213 $").

7
  • We have no clue how you're using it. Commented Jan 10, 2014 at 9:05
  • AFAIK tarfile is a pure-python module, so there's no surprise that it might consume quite a bit more memory than the tar command. Commented Jan 10, 2014 at 9:06
  • 1
    Could you show us your code? There may be a number of reasons why this is happening, as according to the documentation the TarFile class processes its data in blocks of ~(20 * 512) bytes when opened in stream mode. Do you have yours open for random access instead? (docs.python.org/2/library/tarfile.html). Commented Jan 10, 2014 at 10:00
  • You might in deed fare better by using the binary tar instead of that Python tar module in your case. Commented Jan 10, 2014 at 10:47
  • @IgnacioVazquez-Abrams I updated the question with some code, but it's just basic standard code really.. added some specs with the amount of files and filenames length.. if that matters.. Commented Jan 10, 2014 at 15:18

2 Answers 2

5

I did some digging in the source code and it seems that tarfile is storing all files in a list of TarInfo objects (http://docs.python.org/2/library/tarfile.html#tarfile.TarFile.getmembers), causing the ever increasing memory footprint with many and long file names.

The caching of these TarInfo objects seems to have been optimized significantally in a commit from 2008, http://bugs.python.org/issue2058, but from what I can see it was only merged with py3k branch, for Python 3.

One could reset the members list again and again, as in http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/, however I'm not sure what internal tarfile functionality one misses then so I went with using a system level call instead (> os.system('tar -czf asdf.tar asdf/').

Sign up to request clarification or add additional context in comments.

1 Comment

web.archive.org/web/20160714075947/http://blogs.it.ox.ac.uk/… In short: tar = tarfile.open(filename, 'r:gz'); [... code ...]; tar.members = [];
0

two ways to solve: if your VM does not have swap add and try. i have 13GB files to be tarred to a big bundle it was consistently failing. OS killed . Adding 4GB swap helped.

if you are using k8-pod, or docker container one quick workaround could be - add swap in host , capability:sys-admin or privilege mode will use host swap.

if you need tarfile with stream to avoid memory - checkout : https://gist.github.com/leth/6adb9d30f2fdcb8802532a87dfbeff77

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.