-
Notifications
You must be signed in to change notification settings - Fork 170
Description
🐛 Describe the bug
After AWSSDK is integrated with TorchData, we now have two categories of DataPipes to access and load data from AWS S3 Bucket:
DataPipeusingfsspec: It relies ons3fsmodule to list/load data from S3 bucket.DataPipeusingAWSSDK: It relies on pybind fromAWSSDK_CPPmodule.
And, I want to carry out a performance comparison of Lister and Opener/Loader between these two ways.
- For
Listers, I was using the same root path of"s3://ai2-public-datasets/charades"and validated that they returned the same values during iteration.
Testing script
import numpy as np
import timeit
s3_path = "s3://ai2-public-datasets/charades"
def s3_fl_time():
SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister
from __main__ import s3_path
dp = S3FileLister(IterableWrapper([s3_path]), region="us-west-2")
"""
TEST_CODE = """
_ = list(dp)
"""
times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
print(f"S3FileLister: Mean({np.average(times)}), STD({np.std(times)})")
def fsspec_fl_time():
SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister
from __main__ import s3_path
dp = FSSpecFileLister(IterableWrapper([s3_path]), anon=True)
"""
TEST_CODE = """
_ = list(dp)
"""
times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
print(f"FSSpecFileLister: Mean({np.average(times)}), STD({np.std(times)})")
if __name__ == "__main__":
s3_fl_time()
fsspec_fl_time()And the result is:
S3FileLister: Mean(1.7595681754999994), STD(0.20364943594288445)
FSSpecFileLister: Mean(0.19180457339999962), STD(0.5630912985701465)
The FSSpecFileLister performs 10x better than S3FileLister.
- Due to the different behaviors between
S3FileLoaderandFSSpecFileOpener, except iterating over these twoDataPipes, I also carried out an extra experiment by addingreadfrom file returned by theseDataPipes. And, I only used a two datasets hosted on S3 bucket for testing simply to save my time running tests.
Testing script
import numpy as np
import timeit
s3_file_path = ["s3://ai2-public-datasets/charades/Charades.zip", "s3://ai2-public-datasets/charades/CharadesEgo.zip"]
def s3_fo_time():
SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister, S3FileLoader
from __main__ import s3_file_path
dp = S3FileLoader(S3FileLister(IterableWrapper(s3_file_path), region="us-west-2"), region="us-west-2")
"""
TEST_CODE = """
_ = list(dp)
"""
times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
print(f"S3FileLoader: Mean({np.average(times)}), STD({np.std(times)})")
def fsspec_fo_time():
SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, FSSpecFileOpener
from __main__ import s3_file_path
dp = FSSpecFileOpener(FSSpecFileLister(IterableWrapper(s3_file_path), anon=True), mode="rb", anon=True)
"""
TEST_CODE = """
_ = list(dp)
"""
times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
print(f"FSSpecFileOpener: Mean({np.average(times)}), STD({np.std(times)})")
def s3_fo_read_time():
SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister, S3FileLoader
from __main__ import s3_file_path
dp = S3FileLoader(S3FileLister(IterableWrapper(s3_file_path), region="us-west-2"), region="us-west-2").map(lambda x: x.read(), input_col=1)
"""
TEST_CODE = """
_ = list(dp)
"""
times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
print(f"S3FileLoader: Mean({np.average(times)}), STD({np.std(times)})")
def fsspec_fo_read_time():
SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, FSSpecFileOpener
from __main__ import s3_file_path
dp = FSSpecFileOpener(FSSpecFileLister(IterableWrapper(s3_file_path), anon=True), mode="rb", anon=True).map(lambda x: x.read(), input_col=1)
"""
TEST_CODE = """
_ = list(dp)
"""
times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
print(f"FSSpecFileOpener: Mean({np.average(times)}), STD({np.std(times)})")
if __name__ == "__main__":
s3_fo_time()
fsspec_fo_time()
s3_fo_read_time()
fsspec_fo_read_time()And the result is:
# Without `read`
S3FileLoader: Mean(23.793047750200007), STD(5.782844565863793)
FSSpecFileOpener: Mean(2.461926894699997), STD(0.34594020726696345)
# With `read`
S3FileLoader: Mean(31.570115949799998), STD(5.767492995195747)
FSSpecFileOpener: Mean(25.325279079399998), STD(5.052614560529884)
By comparing the results without read, I believe S3FileLoader would trigger loading data but FSSpecFileOpener won't read data from remote. So, it makes more sense to compare these two DataPipes both with the read operation attached. The FSSpecFileOpener still beats S3FileLoader about 25% performance wise.
Due to the performance regression with AWSSDK, it becomes hard for me to recommend users to use native S3FileLister or S3FileLoader.
cc: @ydaiming
Versions
main branch
I only execute these scripts on my Mac as our out AWS cluster doesn't allow me to access the S3.