-
-
Notifications
You must be signed in to change notification settings - Fork 8.6k
16612 download large files #16627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
16612 download large files #16627
Conversation
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||||||
PR Code Suggestions ✨Latest suggestions up to 9faf649
Previous suggestionsSuggestions up to commit e993ed2
|
|||||||||||||||||||||||||||||||||
f42d85e to
2c190cc
Compare
|
Please ping when this is ready for review. |
Sure. I think that the PR is already finished, but there are some tests failing. |
1a93cbc to
9faf649
Compare
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||||
@diemol Ping. This PR is ready for review, as well as few others. |
diemol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 2nd and 3rd code review suggestions are valid. What do you think?
I was thinking about it.
The whole idea is to allow quickly downloading large files from Grid without wasting CPU, I/O or memory. If we apply the suggested solution, we would also save the same file to a temporary location, thus wasting disk space and slowing down the download process. I don't see the need for it.
In theory, we could do it. But again, I don't see the need for it because error responses are always written in JSON format.
It's already done so: the binary body is not written in
Maybe it's reasonable, but this is the code I didn't touch. I prefer not to change it in this PR to avoid too big changes. |
38cbd2e to
1b5d56f
Compare
37fe5e3 to
9ad8f5d
Compare
|
Can you please resolve the conflict? |
I added a new Grid endpoint "/se/files/:name" which allows downloading the file directly, without encoding it to Base64 and adding to Json. This transformation kills the performance and causes OutOfMemory errors for large files (e.g. 256+ MB). NB! Be sure that `toString()` method of objects (HttpRequest, HttpResponse, Contents.Supplier) never returns too long string - it spam debug logs and can cause OOM during debugging.
…ier` to separate classes It makes debugging easier. You can easily see what instances they are and where they come from.
Instead of reading the whole file to a byte array, just save given InputStream directly to the file. Now it can download large files (I tried 4GB) while consuming very low memory.
… downloading files. For json responses, still return `Contents.bytes` which allows re-reading its content multiple times. Just in case.
… deleted After stopping a Grid node, the folder is deleted asynchronously (by cache removal listener). So we need to wait for it in test.
…opped At least on my machine, stopping the node takes some time, and any checks right after `node.stop(sessionId)` often can fail.
Gr... This is extremely hard to debug test. After hours of debugging, I came to a conclusion that we just need to increase the timeout. On my machine, `latch` gets decreased after ~1.2 seconds. So 1 second was not enough.
I don't know why, but sometimes we receive `HttpTimeoutException` instead of `InterruptedException`. Seems reasonable to consider execution as interrupted in both cases. (?)
None of `is.readNBytes` implementations returns -1. It always returns 0 or positive number.
…code` Don't log the entire response body - just content type and length.
9ad8f5d to
58140f5
Compare
|
Great work, thanks for everyone! |
Use file modification time for detecting if the downloading is completed. * File size + time was added to Grid API in Selenium 4.39.0: SeleniumHQ/selenium#16589 * Also, downloading large files was significantly optimized in Selenium 4.39.0: SeleniumHQ/selenium#16627
Use file modification time for detecting if the downloading is completed. * File size + time was added to Grid API in Selenium 4.39.0: SeleniumHQ/selenium#16589 * Also, downloading large files was significantly optimized in Selenium 4.39.0: SeleniumHQ/selenium#16627
User description
🔗 Related Issues
Fixes #16612
💥 What does this PR do?
This PR allows downloading very large files (I tried up to 4GB) from Selenium Grid, while consuming very low memory.
🔄 Types of changes
PR Type
Enhancement, Performance
Description
Add new Grid endpoint
/se/files/:namefor direct file downloads without Base64 encodingOptimize
RemoteWebDriver.downloadFile()to stream files directly instead of loading into memoryExtract anonymous
Contents.Supplierimplementations into separate named classes for better debuggingImprove HTTP response handling to support binary streams and large file transfers
Fix flaky tests by adding proper wait conditions and increased timeouts
Diagram Walkthrough
File Walkthrough
20 files
Add new GET endpoint for direct file downloadsImplement direct file streaming and URL parsing logicOptimize file download to stream directly to diskAdd file and stream content suppliers, deprecate memory-intensivemethodsNew supplier for streaming file content without loading to memoryNew supplier for binary stream content with length trackingExtract bytes supplier to separate class for clarityNew supplier for file-backed output stream contentSwitch to InputStream response handler for large file supportHandle binary streams separately from JSON responsesAdd binary stream detection and avoid loading large files to memoryAdd new GET_DOWNLOADED_FILE command mappingAdd GET_DOWNLOADED_FILE command constantAdd toString method to prevent debug log spamAdd toString method to prevent debug log spamAdd URL decoding utility methodUse atomic long for thread-safe length trackingAdd toString and contentAsString methodsImprove toString to include content informationImprove toString to avoid loading large content5 files
Add wait conditions for async directory deletionIncrease timeout to fix flaky testAdd wait conditions and test for filename extractionHandle HttpTimeoutException in interrupt detectionRemove debug logging statement3 files
Add jspecify dependencyAdd guava dependency for media type handlingAdd test logs upload on failure2 files