Skip to content

clear memory usages of the processed files in occ files:checksums:verify#36787

Merged
karakayasemi merged 1 commit intomasterfrom
optimize-verify-checksums
Jan 24, 2020
Merged

clear memory usages of the processed files in occ files:checksums:verify#36787
karakayasemi merged 1 commit intomasterfrom
optimize-verify-checksums

Conversation

@karakayasemi
Copy link
Copy Markdown
Contributor

@karakayasemi karakayasemi commented Jan 20, 2020

Description

We are calling the recursive walkNodes function for each user when calculating file checksums.

private function walkNodes(array $nodes, \Closure $callBack) {

In recursive calls, memory does not free until the root call finished. Because of that, the worst-case memory complexity of the current approach is the size of all node objects of a user. Let's say the average node object size is 1 KB, a user with 1 million files exceeds 1 GB PHP memory limit in the current approach.

As an easy optimization of the current case, This pr first calculates checksums of the files in a directory and then unsets memories of the processed files and folders. In this case, the limitation will be getDirectoryListing() method which returns the node object list of a given directory. This means if a user has 1 million files in a folder (without nested files), again it exceeds 1 GB PHP memory limit.

However, since there are already other usages of getDirectoryListing method, I guess, we do not need to think about optimization of this method. Reducing worst-case memory complexity to the size of getDirectoryListing method should be sufficient.

Also, the pr improves information messages of the command by showing the current processed user and the command run result.

Related Issue

Motivation and Context

How Has This Been Tested?

  • Create 10000 random files in the root of a user's home directory
  • Create a folder and another 10000 random files inside of the new folder
  • Create 30000 randomly located folder in user's home directory
  • Run occ files:checksums:verify and measure peak memory usage with PHP memory_get_peak_usage() method.

I applied the above scenario in the current master and the PR's branch. The result is proving improvement. Memory usage is much less in PR's branch.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Database schema changes (next release will require increase of minor version instead of patch)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation ticket raised:
  • Changelog item, see TEMPLATE

@update-docs
Copy link
Copy Markdown

update-docs bot commented Jan 20, 2020

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

@karakayasemi karakayasemi self-assigned this Jan 20, 2020
@karakayasemi karakayasemi added this to the development milestone Jan 20, 2020
@karakayasemi karakayasemi requested a review from cdamken January 20, 2020 15:08
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 20, 2020

Codecov Report

Merging #36787 into master will increase coverage by 0.54%.
The diff coverage is 86.66%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #36787      +/-   ##
============================================
+ Coverage     64.14%   64.68%   +0.54%     
- Complexity    19121    19126       +5     
============================================
  Files          1269     1269              
  Lines         74743    74789      +46     
  Branches       1320     1320              
============================================
+ Hits          47946    48380     +434     
+ Misses        26409    26021     -388     
  Partials        388      388
Flag Coverage Δ Complexity Δ
#javascript 54.12% <ø> (ø) 0 <ø> (ø) ⬇️
#phpunit 65.86% <86.66%> (+0.6%) 19126 <13> (+5) ⬆️
Impacted Files Coverage Δ Complexity Δ
apps/files/lib/Command/VerifyChecksums.php 89.24% <86.66%> (-2.32%) 27 <13> (+4)
apps/files/lib/Command/Scan.php 71.92% <0%> (-9.75%) 61% <0%> (ø)
lib/private/Files/Cache/HomePropagator.php 77.77% <0%> (-9.73%) 3% <0%> (ø)
apps/files_sharing/lib/External/Manager.php 90.05% <0%> (+0.05%) 33% <0%> (+1%) ⬆️
lib/private/Files/View.php 84.83% <0%> (+0.29%) 389% <0%> (ø) ⬇️
lib/private/DB/QueryBuilder/QueryBuilder.php 91.34% <0%> (+0.48%) 68% <0%> (ø) ⬇️
apps/dav/lib/Connector/Sabre/File.php 84.19% <0%> (+0.64%) 115% <0%> (ø) ⬇️
lib/private/DB/Connection.php 67.64% <0%> (+0.73%) 49% <0%> (ø) ⬇️
lib/private/DB/MDB2SchemaWriter.php 94.79% <0%> (+1.04%) 34% <0%> (ø) ⬇️
lib/private/DB/MDB2SchemaManager.php 91.22% <0%> (+1.75%) 17% <0%> (ø) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0da7c6...f4ea8df. Read the comment docs.

@jvillafanez
Copy link
Copy Markdown
Member

It seems there is a return statement missing in the walkNodes function...
I don't really see any advantage with that code. I mean, you free the memory for the files but not the directories. If instead having 10k files you have 10k folders, the memory consumption will remain the same

@karakayasemi karakayasemi force-pushed the optimize-verify-checksums branch from f29f4e0 to 6de348d Compare January 21, 2020 16:20
@karakayasemi
Copy link
Copy Markdown
Contributor Author

It seems there is a return statement missing in the walkNodes function...

There was a copy-paste mistake in the code, I fixed it now. Like before walkNodes is void function.

I don't really see any advantage with that code. I mean, you free the memory for the files but not the directories. If instead having 10k files you have 10k folders, the memory consumption will remain the same

If all the files are in the same directory and at the same level, there will be no advantage or as you said, if all tree consists of folders there will be no advantage.
However, I expect the number of files should be much greater than folders. In my approach, when the recursive function branching to the new folder, firstly it completes child file nodes operations and then clear memory usages of the files. In this way, it holds at most one folder's child node files in the memory. Currently, we never clear processed files from memory, until all the files of the user processed.

@karakayasemi karakayasemi marked this pull request as ready for review January 22, 2020 07:38
@jvillafanez
Copy link
Copy Markdown
Member

Any easy chance to get rid of the recursion and use a queue (or similar) instead? The idea would be:

  1. Start with a queue with some items (this would be the $nodes). Probably we'd just need the folders. Files could have been processed before.
  2. For the current item (assuming it's a folder), list the contents.
  3. For each of those contents, if it's a file then process it; if not add it to the queue
  4. Remove the current item from the queue, move to the next item and go to step2

The problem is that I haven't found a "valid" implementation for the queue, mainly because I had troubles freeing the memory of the iterator at step4. Worst case, we might need to implement our own queue, which doesn't seem worthy.

If we need many changes for the iterator approach, I think we can use the solution with the recursion.

@karakayasemi karakayasemi force-pushed the optimize-verify-checksums branch 3 times, most recently from 1bd6785 to 5bf5b0d Compare January 23, 2020 07:11
@karakayasemi
Copy link
Copy Markdown
Contributor Author

karakayasemi commented Jan 23, 2020

I re-implemented the recursive method in an iterative way. In addition, now folders are stored in an array queue and the processed folders, popped from the queue. In this way, we will optimize memory usage for folders also. I added 30000k folders in random locations to the test scenario. The peak memory usage is better than before, as expected.
@jvillafanez please take a look at the code one more time. Thanks.

@jvillafanez
Copy link
Copy Markdown
Member

Could you double-check that the traversal order makes sense? I initially expected to use a FIFO queue instead of LIFO since this would traverse the tree by depth level (if I didn't make any mistake). With the current code, I think you'll check the first level, then check the last folder's contents, and from there go deeper with the "next" last folder's contents... sounds weird.

I don't really think this is something important, so if it isn't easy to fix there is no need to change anything. A lower memory usage takes precedence over this.
Note that array_shift will re-index the array, which could take a lot of time, so that isn't a good solution.

@karakayasemi
Copy link
Copy Markdown
Contributor Author

Note that array_shift will re-index the array, which could take a lot of time, so that isn't a good solution.

That's why I used stack (array_pop()) instead of queue(array_shift()). I intended to implement it as you said at first. But, array_pop is much more efficient in terms of time complexity (O(1) vs O(N)).

Currently implemented searching algorithm is the depth-first search algorithm. If we use a queue instead of a stack, it turns to a breadth-first search. However, if you travel across all nodes, time and memory complexity is the same for both of them. Almost no difference between them for our case.

@jvillafanez
Copy link
Copy Markdown
Member

Currently implemented searching algorithm is the depth-first search algorithm

Not entirely true. We'd need to add the files to the queue too for the algorithm to work as expected. Right now, there are files being processed when we're down the tree.

In any case, we need to process all the files, so the order doesn't really matters. It's more important to keep the memory usage as low as possible, and the current algorithm does a good job on that regard.

@jvillafanez
Copy link
Copy Markdown
Member

Taking into account that @karakayasemi says that the memory usage has gone down, I assume this has been tested (dev-wise) and it still works as expected.

@karakayasemi karakayasemi force-pushed the optimize-verify-checksums branch from 5bf5b0d to f4ea8df Compare January 23, 2020 14:39
@karakayasemi karakayasemi merged commit 5812ff6 into master Jan 24, 2020
@delete-merged-branch delete-merged-branch bot deleted the optimize-verify-checksums branch January 24, 2020 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Command files:checksums:verify doesn't print that the verification went well when showing output.

2 participants