Skip to content

Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes #500

@ejguan

Description

@ejguan

🐛 Describe the bug

After AWSSDK is integrated with TorchData, we now have two categories of DataPipes to access and load data from AWS S3 Bucket:

  1. DataPipe using fsspec: It relies on s3fs module to list/load data from S3 bucket.
  2. DataPipe using AWSSDK: It relies on pybind from AWSSDK_CPP module.

And, I want to carry out a performance comparison of Lister and Opener/Loader between these two ways.

  • For Listers, I was using the same root path of "s3://ai2-public-datasets/charades" and validated that they returned the same values during iteration.
Testing script
import numpy as np
import timeit

s3_path = "s3://ai2-public-datasets/charades"

def s3_fl_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister
from __main__ import s3_path
dp = S3FileLister(IterableWrapper([s3_path]), region="us-west-2")
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"S3FileLister: Mean({np.average(times)}), STD({np.std(times)})")

def fsspec_fl_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister
from __main__ import s3_path
dp = FSSpecFileLister(IterableWrapper([s3_path]), anon=True)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"FSSpecFileLister: Mean({np.average(times)}), STD({np.std(times)})")

if __name__ == "__main__":
    s3_fl_time()
    fsspec_fl_time()

And the result is:

S3FileLister: Mean(1.7595681754999994), STD(0.20364943594288445)
FSSpecFileLister: Mean(0.19180457339999962), STD(0.5630912985701465)

The FSSpecFileLister performs 10x better than S3FileLister.

  • Due to the different behaviors between S3FileLoader and FSSpecFileOpener, except iterating over these two DataPipes, I also carried out an extra experiment by adding read from file returned by these DataPipes. And, I only used a two datasets hosted on S3 bucket for testing simply to save my time running tests.
Testing script
import numpy as np
import timeit

s3_file_path = ["s3://ai2-public-datasets/charades/Charades.zip", "s3://ai2-public-datasets/charades/CharadesEgo.zip"]

def s3_fo_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister, S3FileLoader
from __main__ import s3_file_path
dp = S3FileLoader(S3FileLister(IterableWrapper(s3_file_path), region="us-west-2"), region="us-west-2")
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"S3FileLoader: Mean({np.average(times)}), STD({np.std(times)})")

def fsspec_fo_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, FSSpecFileOpener
from __main__ import s3_file_path
dp = FSSpecFileOpener(FSSpecFileLister(IterableWrapper(s3_file_path), anon=True), mode="rb", anon=True)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"FSSpecFileOpener: Mean({np.average(times)}), STD({np.std(times)})")

def s3_fo_read_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister, S3FileLoader
from __main__ import s3_file_path
dp = S3FileLoader(S3FileLister(IterableWrapper(s3_file_path), region="us-west-2"), region="us-west-2").map(lambda x: x.read(), input_col=1)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"S3FileLoader: Mean({np.average(times)}), STD({np.std(times)})")

def fsspec_fo_read_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, FSSpecFileOpener
from __main__ import s3_file_path
dp = FSSpecFileOpener(FSSpecFileLister(IterableWrapper(s3_file_path), anon=True), mode="rb", anon=True).map(lambda x: x.read(), input_col=1)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"FSSpecFileOpener: Mean({np.average(times)}), STD({np.std(times)})")

if __name__ == "__main__":
    s3_fo_time()
    fsspec_fo_time()
    s3_fo_read_time()
    fsspec_fo_read_time()

And the result is:

# Without `read`
S3FileLoader: Mean(23.793047750200007), STD(5.782844565863793)
FSSpecFileOpener: Mean(2.461926894699997), STD(0.34594020726696345)
# With `read`
S3FileLoader: Mean(31.570115949799998), STD(5.767492995195747)
FSSpecFileOpener: Mean(25.325279079399998), STD(5.052614560529884)

By comparing the results without read, I believe S3FileLoader would trigger loading data but FSSpecFileOpener won't read data from remote. So, it makes more sense to compare these two DataPipes both with the read operation attached. The FSSpecFileOpener still beats S3FileLoader about 25% performance wise.

Due to the performance regression with AWSSDK, it becomes hard for me to recommend users to use native S3FileLister or S3FileLoader.

cc: @ydaiming

Versions

main branch

I only execute these scripts on my Mac as our out AWS cluster doesn't allow me to access the S3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions