-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
bpo-34043: Optimize tarfile uncompress performance #8089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tarfile._Stream has two buffer for compressed and uncompressed data. Those buffers are not aligned so unnecessary bytes slicing happens for every reading chunks. This commit bypass compressed buffering. In this benchmark [1], user time become 250ms from 300ms. [1]: https://bugs.python.org/msg320763
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor nitpick.
Lib/tarfile.py
Outdated
| # Skip underlaying buffer to avoid unaligned double | ||
| # buffering. | ||
| if self.buf: | ||
| buf, self.buf = self.buf, b"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I would prefer to do that on two lines, but it's just a matter of taste.
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just saw a typo in a comment.
Lib/tarfile.py
Outdated
| buf = self.__read(self.bufsize) | ||
| if not buf: | ||
| break | ||
| # Skip underlaying buffer to avoid unaligned double |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? underlying?
| If size is not defined, return all bytes of the stream | ||
| up to EOF. | ||
| """ | ||
| if size is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only one question. Why this branch was removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is follow up of bpo-34010, (GH-8020).
See also, https://bugs.python.org/issue34010#msg321040
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Instances of _Stream are never leaked to the end user? Then this change LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be accessed via TarFile.fileobj.
But using it directly is not pragramatic, and TarFile.fileobj.read() caused TypeError because of "".join() bug.
So it must not used for previous versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that the class is private, but I didn't notice that TarFile.fileobj is public. Maybe just replace the assertion with a regular if/raise, just in case? Maybe raise a NotImplementedError?
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.
This commit bypass compressed buffering.
In this benchmark 1, user time become 250ms from 300ms.
https://bugs.python.org/issue34043