clear memory usages of the processed files in occ files:checksums:verify#36787

karakayasemi · 2020-01-20T14:59:32Z

Description

We are calling the recursive walkNodes function for each user when calculating file checksums.

core/apps/files/lib/Command/VerifyChecksums.php

Line 174 in c0da7c6

private function walkNodes(array $nodes, \Closure $callBack) {

In recursive calls, memory does not free until the root call finished. Because of that, the worst-case memory complexity of the current approach is the size of all node objects of a user. Let's say the average node object size is 1 KB, a user with 1 million files exceeds 1 GB PHP memory limit in the current approach.

As an easy optimization of the current case, This pr first calculates checksums of the files in a directory and then unsets memories of the processed files and folders. In this case, the limitation will be getDirectoryListing() method which returns the node object list of a given directory. This means if a user has 1 million files in a folder (without nested files), again it exceeds 1 GB PHP memory limit.

However, since there are already other usages of getDirectoryListing method, I guess, we do not need to think about optimization of this method. Reducing worst-case memory complexity to the size of getDirectoryListing method should be sufficient.

Also, the pr improves information messages of the command by showing the current processed user and the command run result.

Related Issue

Motivation and Context

How Has This Been Tested?

Create 10000 random files in the root of a user's home directory
Create a folder and another 10000 random files inside of the new folder
Create 30000 randomly located folder in user's home directory
Run occ files:checksums:verify and measure peak memory usage with PHP memory_get_peak_usage() method.

I applied the above scenario in the current master and the PR's branch. The result is proving improvement. Memory usage is much less in PR's branch.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Database schema changes (next release will require increase of minor version instead of patch)
Breaking change (fix or feature that would cause existing functionality to change)
Technical debt
Tests only (no source changes)

Checklist:

update-docs · 2020-01-20T14:59:36Z

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

codecov · 2020-01-20T15:57:35Z

Codecov Report

Merging #36787 into master will increase coverage by 0.54%.
The diff coverage is 86.66%.

@@             Coverage Diff              @@
##             master   #36787      +/-   ##
============================================
+ Coverage     64.14%   64.68%   +0.54%     
- Complexity    19121    19126       +5     
============================================
  Files          1269     1269              
  Lines         74743    74789      +46     
  Branches       1320     1320              
============================================
+ Hits          47946    48380     +434     
+ Misses        26409    26021     -388     
  Partials        388      388

Flag	Coverage Δ	Complexity Δ
#javascript	`54.12% <ø> (ø)`	`0 <ø> (ø)`	⬇️
#phpunit	`65.86% <86.66%> (+0.6%)`	`19126 <13> (+5)`	⬆️

Impacted Files	Coverage Δ	Complexity Δ
apps/files/lib/Command/VerifyChecksums.php	`89.24% <86.66%> (-2.32%)`	`27 <13> (+4)`
apps/files/lib/Command/Scan.php	`71.92% <0%> (-9.75%)`	`61% <0%> (ø)`
lib/private/Files/Cache/HomePropagator.php	`77.77% <0%> (-9.73%)`	`3% <0%> (ø)`
apps/files_sharing/lib/External/Manager.php	`90.05% <0%> (+0.05%)`	`33% <0%> (+1%)`	⬆️
lib/private/Files/View.php	`84.83% <0%> (+0.29%)`	`389% <0%> (ø)`	⬇️
lib/private/DB/QueryBuilder/QueryBuilder.php	`91.34% <0%> (+0.48%)`	`68% <0%> (ø)`	⬇️
apps/dav/lib/Connector/Sabre/File.php	`84.19% <0%> (+0.64%)`	`115% <0%> (ø)`	⬇️
lib/private/DB/Connection.php	`67.64% <0%> (+0.73%)`	`49% <0%> (ø)`	⬇️
lib/private/DB/MDB2SchemaWriter.php	`94.79% <0%> (+1.04%)`	`34% <0%> (ø)`	⬇️
lib/private/DB/MDB2SchemaManager.php	`91.22% <0%> (+1.75%)`	`17% <0%> (ø)`	⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0da7c6...f4ea8df. Read the comment docs.

jvillafanez · 2020-01-21T15:55:46Z

It seems there is a return statement missing in the walkNodes function...
I don't really see any advantage with that code. I mean, you free the memory for the files but not the directories. If instead having 10k files you have 10k folders, the memory consumption will remain the same

karakayasemi · 2020-01-22T07:36:44Z

It seems there is a return statement missing in the walkNodes function...

There was a copy-paste mistake in the code, I fixed it now. Like before walkNodes is void function.

I don't really see any advantage with that code. I mean, you free the memory for the files but not the directories. If instead having 10k files you have 10k folders, the memory consumption will remain the same

If all the files are in the same directory and at the same level, there will be no advantage or as you said, if all tree consists of folders there will be no advantage.
However, I expect the number of files should be much greater than folders. In my approach, when the recursive function branching to the new folder, firstly it completes child file nodes operations and then clear memory usages of the files. In this way, it holds at most one folder's child node files in the memory. Currently, we never clear processed files from memory, until all the files of the user processed.

jvillafanez · 2020-01-22T09:54:57Z

Any easy chance to get rid of the recursion and use a queue (or similar) instead? The idea would be:

Start with a queue with some items (this would be the $nodes). Probably we'd just need the folders. Files could have been processed before.
For the current item (assuming it's a folder), list the contents.
For each of those contents, if it's a file then process it; if not add it to the queue
Remove the current item from the queue, move to the next item and go to step2

The problem is that I haven't found a "valid" implementation for the queue, mainly because I had troubles freeing the memory of the iterator at step4. Worst case, we might need to implement our own queue, which doesn't seem worthy.

If we need many changes for the iterator approach, I think we can use the solution with the recursion.

karakayasemi · 2020-01-23T07:20:31Z

I re-implemented the recursive method in an iterative way. In addition, now folders are stored in an array queue and the processed folders, popped from the queue. In this way, we will optimize memory usage for folders also. I added 30000k folders in random locations to the test scenario. The peak memory usage is better than before, as expected.
@jvillafanez please take a look at the code one more time. Thanks.

jvillafanez · 2020-01-23T09:18:32Z

Could you double-check that the traversal order makes sense? I initially expected to use a FIFO queue instead of LIFO since this would traverse the tree by depth level (if I didn't make any mistake). With the current code, I think you'll check the first level, then check the last folder's contents, and from there go deeper with the "next" last folder's contents... sounds weird.

I don't really think this is something important, so if it isn't easy to fix there is no need to change anything. A lower memory usage takes precedence over this.
Note that array_shift will re-index the array, which could take a lot of time, so that isn't a good solution.

karakayasemi · 2020-01-23T11:09:31Z

Note that array_shift will re-index the array, which could take a lot of time, so that isn't a good solution.

That's why I used stack (array_pop()) instead of queue(array_shift()). I intended to implement it as you said at first. But, array_pop is much more efficient in terms of time complexity (O(1) vs O(N)).

Currently implemented searching algorithm is the depth-first search algorithm. If we use a queue instead of a stack, it turns to a breadth-first search. However, if you travel across all nodes, time and memory complexity is the same for both of them. Almost no difference between them for our case.

jvillafanez · 2020-01-23T11:40:48Z

Currently implemented searching algorithm is the depth-first search algorithm

Not entirely true. We'd need to add the files to the queue too for the algorithm to work as expected. Right now, there are files being processed when we're down the tree.

In any case, we need to process all the files, so the order doesn't really matters. It's more important to keep the memory usage as low as possible, and the current algorithm does a good job on that regard.

jvillafanez · 2020-01-23T11:46:01Z

Taking into account that @karakayasemi says that the memory usage has gone down, I assume this has been tested (dev-wise) and it still works as expected.

karakayasemi self-assigned this Jan 20, 2020

karakayasemi added this to the development milestone Jan 20, 2020

karakayasemi requested a review from cdamken January 20, 2020 15:08

karakayasemi force-pushed the optimize-verify-checksums branch from f29f4e0 to 6de348d Compare January 21, 2020 16:20

karakayasemi requested a review from jvillafanez January 22, 2020 07:38

karakayasemi marked this pull request as ready for review January 22, 2020 07:38

karakayasemi force-pushed the optimize-verify-checksums branch 3 times, most recently from 1bd6785 to 5bf5b0d Compare January 23, 2020 07:11

jvillafanez approved these changes Jan 23, 2020

View reviewed changes

clear memory usages of the processed files in occ files:checksums:verify

f4ea8df

karakayasemi force-pushed the optimize-verify-checksums branch from 5bf5b0d to f4ea8df Compare January 23, 2020 14:39

karakayasemi merged commit 5812ff6 into master Jan 24, 2020

delete-merged-branch bot deleted the optimize-verify-checksums branch January 24, 2020 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clear memory usages of the processed files in occ files:checksums:verify#36787