-
-
Notifications
You must be signed in to change notification settings - Fork 221
optimize Dumper::encodeString for long strings #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
How will EDIT: I tried it myself and it seems to work fine. |
| $shortened = TRUE; | ||
| break; | ||
| } | ||
| } while (isset($s[++$i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block should be only done when the string was not already shortened by mb_substr, i.e.
} elseif ($maxLength && $s !== '') {should be changed to
} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {
src/Tracy/Dumper.php
Outdated
| if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) { | ||
| $s = strtr($s, $table); | ||
| } | ||
| } elseif (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speed of the preg_match call be made faster by inverting the regexp, i.e. changing
preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()to
!preg_match('#^[\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]*+\z#u', $s) || preg_last_error()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is virtually the same for all test cases in my benchmark except for the test case where the whole string is 1 multibyte character repeated N times which is measurably faster with your regexp. I don't really understand why that is, but thanks.
src/Tracy/Dumper.php
Outdated
| } | ||
| if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) { | ||
| $s = strtr($s, $table); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code duplication can be avoided if you change the following elseif back to if
| if ($shortened = ($maxLength && strlen($s) > $maxLength)) { | ||
| $s = substr($s, 0, $maxLength); | ||
| } | ||
| $s = strtr($s, $table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changing the elseif statement back to if, you wil need to change the $shortened variable only when the string is futher shortend, i.e.
if ($maxLength && strlen($s) > $maxLength) {
$s = substr($s, 0, $maxLength);
$shortened = TRUE;
}
$s = strtr($s, $table);
src/Tracy/Dumper.php
Outdated
| $shortened = $s !== $tmp; | ||
| } else { | ||
| $shortened = false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified by initializing $shortend variable first to FALSE (note the uppercase letters)
$shortened = FALSE;
if ($maxLength && strlen($s) > $maxLength && function_exists('mb_substr')) {
$s = mb_substr($tmp = $s, 0, $maxLength, 'UTF-8');
$shortened = $s !== $tmp;
}|
@JanTvrdik Thanks for your comments. It didn't think of checking $lengths = [];
for ($i < 0; $i < 1000; ++$i) {
$s = openssl_random_pseudo_bytes(1000);
@$lengths[strlen(mb_substr($s, 0, 150, 'UTF-8'))]++;
}
ksort($lengths);
foreach ($lengths as $k => $val) {
echo $k . " " . str_repeat('|', $val) . "\n";
}It makes the substring mostly around 200 bytes long, which is fine I guess. |
|
I updated the benchmark with latest changes to this PR. |
src/Tracy/Dumper.php
Outdated
| } | ||
| } while (isset($s[++$i])); | ||
| } | ||
| } elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$s !== '' part of the condition can be changed to strlen($s) > $maxLength
|
@dg Your implementation is a lot slower for long UTF-8 strings. |
|
As I understand, „the performance problem is in |
|
@dg I don't have time to test it right now, maybe in the evening, but I think the problem is that you still call |
|
I see, your intention is to remove utf-8 checking at all before shortening. |
|
@dg Yes. |
…d string [Closes #223] thanks to @schlndh and @JanTvrdik
|
Repushed, now it should be faster when even mb_substr is disabled. |
…d string [Closes #223] thanks to @schlndh and @JanTvrdik
Usually, but if you try string |
|
I know, it is acceptable. |
|
Accepting issue which can be fixed with single line (e.g. |
…d string [Closes #223] thanks to @schlndh and @JanTvrdik
The purpose of this PR is to speedup dumping of large strings for reasonable maxLength. The performance problem is in
preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s)which scans the whole string. By preferringmb_substrit is possible to truncate the string before running it through the regex and thus potentially saving a lot of time for large strings. However this is a slight BC break, because now it won't recognize some binary strings it did recognize before, but since the binary part will be outside the truncated string I think it is worth it.As you can see from the benchmark I made the performance impact for short strings or large strings with unlimited maxLength is minimal or none, however the performance gain for long strings with small maxLength is massive.
The benchmark compares current
encodeStringimplementation (encodeStringOrig) with implementation from this PR (encodeStringChanged) and just running thepreg_matchon the whole string. The first column (N) is the number of characters and the values of other columns are average times for one call on given string in nanoseconds.Benchmarking was done on Linux with PHP 7.0.10