Reduce memory footprint of trash expiry jobs#36602
Conversation
|
@jvillafanez @DeepDiver1975 @pako81 will 500 users per session be good? The same we have for FileScan - https://github.com/owncloud/core/blob/master/apps/files/lib/BackgroundJob/ScanFiles.php#L98 |
0d32490 to
04e27f3
Compare
492cae5 to
7d006b9
Compare
Codecov Report
@@ Coverage Diff @@
## master #36602 +/- ##
=========================================
Coverage 64.69% 64.69%
- Complexity 19114 19127 +13
=========================================
Files 1269 1270 +1
Lines 74808 74843 +35
Branches 1327 1327
=========================================
+ Hits 48395 48423 +28
- Misses 26022 26029 +7
Partials 391 391
Continue to review full report at Codecov.
|
454f6d1 to
5fe6767
Compare
|
@DeepDiver1975 @jvillafanez MR is ready. |
| } | ||
|
|
||
| if ($limit !== null || $offset !== null) { | ||
| $qb->orderBy('user_id'); // needed for predictable limit & offset |
There was a problem hiding this comment.
I think it's better to switch to the (account) "id" instead of the "user_id". If a user is added or removed between runs, this could lead to a user being checked twice or more while another user being neglected for a long time.
Due to the id being an autoincrement column, it's expected that new users will always be added in the last position, not in between.
There was a problem hiding this comment.
In addition, we can change the "offset" to "initialId".
Instead of SELECT * FROM table WHERE ..... LIMIT limit OFFSET offset we can use SELECT * FROM table WHERE id > initialID ...... LIMIT limit and store the last id for the next run. This way, if a user is removed we won't rescan a previously-scanned user. In addition, we can take advantage of DB indexes
There was a problem hiding this comment.
I think it is better to order by user_id, at least in the scope of this PR. Please have a look in another places in the class, and there we order by user_id (for some reason it was done, and I would not change it before some consultation). For the second comment, I commented in #36602 (comment).
There was a problem hiding this comment.
as long as the sorting column is indexed, I don't think it's relevant if you sort by id or user_id. My point is that the "offset" clause is bad and we should get rid of it (https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/)
Taking into account that this is a new function, it could be a good starting point. Maybe the change is bigger than I think, so if you think the change is too big for the PR, we can go with the current code for now.
There was a problem hiding this comment.
@jvillafanez I am not sure if this is a good idea in this case to expose rowid/accountids for limit, as we use IUserManager->callForUsers public interface - limit and offset are already used in this class in 3 other functions. If this was some internal implementation to page data I would be more then opting for this solution.
I know it is super tempting in this case, but it might happen that other use-cases would be interested in non-arbitrary pages (e.g. you have UI with 100 users per page, and 1000 pages, and I want to calculate something for a page no. 4, how do you do that with the offset being some magic number?).
There was a problem hiding this comment.
I'm willing to discuss this offline. Anyway, we can keep the limit + offset approach in this case.
| $this->config->setAppValue('files_trashbin', 'cronjob_trash_expiry_offset', 0); | ||
| } else { | ||
| $offset += self::USERS_PER_SESSION; | ||
| $this->config->setAppValue('files_trashbin', 'cronjob_trash_expiry_offset', $offset); |
There was a problem hiding this comment.
As said (below?), it might be better to store the last account id checked so we can start over from there the next run, in order to have better consistency if users are added or removed.
There was a problem hiding this comment.
I do not think it is a good idea. In that class we already use offset and limit, also FileScan uses similar mechanism - https://github.com/owncloud/core/blob/master/apps/files/lib/BackgroundJob/ScanFiles.php#L98. I would not over-engineer it here if this is not needed.
There was a problem hiding this comment.
I refactored this part of the code to always remember last scanned user. If users are added, no expiration would be needed to be done. If user is removed, there wont be anything to expire probably.
| return $this->expiration->isEnabled(); | ||
| } | ||
|
|
||
| public function expiryByRetentionEnabled() { |
There was a problem hiding this comment.
This might need some docs. This method name is confusing, and I'm not sure when I should use this method or the expiryEnabled one
| if ($this->expiration->isExpired($trashbinEntry->getMtime())) { | ||
| Trashbin::delete($trashbinEntry->getName(), $uid, $trashbinEntry->getMtime()); | ||
| $this->logger->info( | ||
| 'Remove "' . $trashbinEntry->getName() . '" from "' . $uid . '" trashbin because it exceeds max retention obligation term.', |
There was a problem hiding this comment.
String interpolation is preferred over concatenation (https://blog.blackfire.io/php-7-performance-improvements-encapsed-strings-optimization.html). It also might be easier to read.
There was a problem hiding this comment.
Refactored in a class
| ); | ||
|
|
||
| // unset the deleted file as it was already processed | ||
| unset($remainingUserTrashbinContent[$key]); |
There was a problem hiding this comment.
Confirm this works as expected. Modifying the list you're looping causes problems in most of the languages.
I think it's works because you're looping over a copy of the array, not over the array itself, so modifying the array won't affect the loop.
There was a problem hiding this comment.
I refactored it so it is more clear that this is a associative array (unset just removed the index key from array, that still later can be iterated with foreach)
before: [ 0 => a, 1=>b, 2=>c]
after: [ 0 => a, 2=>c]
apps/files_trashbin/lib/Trashbin.php
Outdated
| /** | ||
| * @return Quota | ||
| */ | ||
| protected static function getQuota() { |
There was a problem hiding this comment.
I don't think this method is useful. Either move the code to where it's being used or clarify in the phpdocs that the method will always return a new Quota instance.
There was a problem hiding this comment.
It is being used just a function above https://github.com/owncloud/core/blob/v10.3.2/apps/files_trashbin/lib/Trashbin.php#L778. Lets keep it in the class that it is used at.
There was a problem hiding this comment.
the function is being used, but what's the point of a function that just creates a new object? Instead of doing self::getQuota() you can do new Quota(.....).
Right now, I don't see any benefit: the function is only used in one place (it isn't being reused) and you can't mock it in the unittests. Note that this might be confusing in the long run because 95% of the "get" function will return always the same instance, they won't create a new instance each time they're called.
micbar
left a comment
There was a problem hiding this comment.
Please add changelog item
|
@jvillafanez @micbar @phil-davis while reviewing, please note that also refactor of trashbin manager was motivated to be able to test it with unit tests properly and cover whole code there. I also did test expiration by background jobs (both space and retention) manually. |
jvillafanez
left a comment
There was a problem hiding this comment.
Just a little detail, but I can live with it.
| /** | ||
| * Tear down owner fs | ||
| */ | ||
| protected function tearDownFS() { |
There was a problem hiding this comment.
this isn't really needed. Just move the \OC_Util::tearDownFS() call inside the run method
| } | ||
|
|
||
| if ($limit !== null || $offset !== null) { | ||
| $qb->orderBy('user_id'); // needed for predictable limit & offset |
There was a problem hiding this comment.
I'm willing to discuss this offline. Anyway, we can keep the limit + offset approach in this case.
|
|
||
| // increase the offset to know where to start next | ||
| // e.g. in case of crash of this job due to memory | ||
| $this->config->setAppValue('files_trashbin', 'cronjob_trash_expiry_offset', $offset + $usersScanned); |
There was a problem hiding this comment.
Just saying, you can sell this better as a transactional mechanism to deal with a memory crash.
Related issue: https://github.com/owncloud/enterprise/issues/2925
Main problem: PHP PDO driver leaks memory - https://www.dragonbe.com/2015/07/speeding-up-database-calls-with-pdo-and.html when used in long running jobs (php pdo works well in online request where scrips live short). We cannot have scrollable PDO cursor (and it is not good idea either sometimes), and thus whole resultset is returned to memory causing memory allocation peak, when calling execute. This causes memory to cumulate and memory allocated to script is not released in long running jobs like trash expiry.
We also steadily leak memory in below when run 1000 times for 1000 users.
Solution:
reduce memory footprint where possible for single user
In order not to keep memory when it is not used after resulting large result set, PHP script has to terminate. Thus, do not process all users at once within 1 PHP script, we need to chunk expiry for users (chunk according to max number of files processed for user?)
do not fetch trashbin locations for background jobs (as it is not used)
split trashbin manager (online queries mostly) and trash expiry manager (background jobs only) to optimize for each case
do not keep memory for results of functions that potentially can return large results sets and keep them for duration of function, causing memory to grow further
callForUser with limit and offset to allow page users
add user/account unit tests
do not allow cron to process all the users at once (run many php crons instead, chunk users)
add trashbin expiry unit tests
@DeepDiver1975 @jvillafanez @PVince81