Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes

### 🐛 Describe the bug

After `AWSSDK` is integrated with TorchData, we now have two categories of `DataPipe`s to access and load data from AWS S3 Bucket:
1. [`DataPipe`](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/fsspec.py) using `fsspec`: It relies on `s3fs` module to list/load data from S3 bucket. 
2. [`DataPipe`](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/s3io.py) using `AWSSDK`: It relies on pybind from [`AWSSDK_CPP`](https://github.com/aws/aws-sdk-cpp) module.

And, I want to carry out a performance comparison of `Lister` and `Opener`/`Loader` between these two ways.
- For `Lister`s, I was using the same root path of `"s3://ai2-public-datasets/charades"` and validated that they returned the same values during iteration.
<details>
<summary>Testing script</summary>

```py
import numpy as np
import timeit

s3_path = "s3://ai2-public-datasets/charades"

def s3_fl_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister
from __main__ import s3_path
dp = S3FileLister(IterableWrapper([s3_path]), region="us-west-2")
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"S3FileLister: Mean({np.average(times)}), STD({np.std(times)})")

def fsspec_fl_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister
from __main__ import s3_path
dp = FSSpecFileLister(IterableWrapper([s3_path]), anon=True)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"FSSpecFileLister: Mean({np.average(times)}), STD({np.std(times)})")

if __name__ == "__main__":
    s3_fl_time()
    fsspec_fl_time()
```

</details>

And the result is:
```
S3FileLister: Mean(1.7595681754999994), STD(0.20364943594288445)
FSSpecFileLister: Mean(0.19180457339999962), STD(0.5630912985701465)
```
The `FSSpecFileLister` performs 10x better than `S3FileLister`.

- Due to the different behaviors between `S3FileLoader` and `FSSpecFileOpener`, except iterating over these two `DataPipe`s, I also carried out an extra experiment by adding `read` from file returned by these `DataPipe`s. And, I only used a two datasets hosted on S3 bucket for testing simply to save my time running tests.

<details>
<summary>Testing script</summary>

```py
import numpy as np
import timeit

s3_file_path = ["s3://ai2-public-datasets/charades/Charades.zip", "s3://ai2-public-datasets/charades/CharadesEgo.zip"]

def s3_fo_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister, S3FileLoader
from __main__ import s3_file_path
dp = S3FileLoader(S3FileLister(IterableWrapper(s3_file_path), region="us-west-2"), region="us-west-2")
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"S3FileLoader: Mean({np.average(times)}), STD({np.std(times)})")

def fsspec_fo_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, FSSpecFileOpener
from __main__ import s3_file_path
dp = FSSpecFileOpener(FSSpecFileLister(IterableWrapper(s3_file_path), anon=True), mode="rb", anon=True)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"FSSpecFileOpener: Mean({np.average(times)}), STD({np.std(times)})")

def s3_fo_read_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, S3FileLister, S3FileLoader
from __main__ import s3_file_path
dp = S3FileLoader(S3FileLister(IterableWrapper(s3_file_path), region="us-west-2"), region="us-west-2").map(lambda x: x.read(), input_col=1)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"S3FileLoader: Mean({np.average(times)}), STD({np.std(times)})")

def fsspec_fo_read_time():
    SETUP_CODE = """
from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, FSSpecFileOpener
from __main__ import s3_file_path
dp = FSSpecFileOpener(FSSpecFileLister(IterableWrapper(s3_file_path), anon=True), mode="rb", anon=True).map(lambda x: x.read(), input_col=1)
"""
    TEST_CODE = """
_ = list(dp)
"""
    times = timeit.repeat(setup = SETUP_CODE, stmt = TEST_CODE, repeat=10, number = 5)
    print(f"FSSpecFileOpener: Mean({np.average(times)}), STD({np.std(times)})")

if __name__ == "__main__":
    s3_fo_time()
    fsspec_fo_time()
    s3_fo_read_time()
    fsspec_fo_read_time()
```

</details>

And the result is:
```
# Without `read`
S3FileLoader: Mean(23.793047750200007), STD(5.782844565863793)
FSSpecFileOpener: Mean(2.461926894699997), STD(0.34594020726696345)
# With `read`
S3FileLoader: Mean(31.570115949799998), STD(5.767492995195747)
FSSpecFileOpener: Mean(25.325279079399998), STD(5.052614560529884)
```

By comparing the results without `read`, I believe `S3FileLoader` would trigger loading data but `FSSpecFileOpener` won't read data from remote. So, it makes more sense to compare these two `DataPipe`s both with the `read` operation attached. The `FSSpecFileOpener` still beats `S3FileLoader` about 25% performance wise.

Due to the performance regression with `AWSSDK`, it becomes hard for me to recommend users to use native `S3FileLister` or `S3FileLoader`.

cc: @ydaiming 

### Versions

main branch

I only execute these scripts on my Mac as our out AWS cluster doesn't allow me to access the S3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes #500

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes #500

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions