optimize Dumper::encodeString for long strings #223

schlndh · 2016-09-18T08:54:54Z

feature
issues - none
documentation - not needed
BC break - yes

The purpose of this PR is to speedup dumping of large strings for reasonable maxLength. The performance problem is in preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) which scans the whole string. By preferring mb_substr it is possible to truncate the string before running it through the regex and thus potentially saving a lot of time for large strings. However this is a slight BC break, because now it won't recognize some binary strings it did recognize before, but since the binary part will be outside the truncated string I think it is worth it.

As you can see from the benchmark I made the performance impact for short strings or large strings with unlimited maxLength is minimal or none, however the performance gain for long strings with small maxLength is massive.

The benchmark compares current encodeString implementation (encodeStringOrig) with implementation from this PR (encodeStringChanged) and just running the preg_match on the whole string. The first column (N) is the number of characters and the values of other columns are average times for one call on given string in nanoseconds.

Benchmarking was done on Linux with PHP 7.0.10

JanTvrdik · 2016-09-18T11:45:17Z

How will mb_substr handle binary (not UTF-8) strings?

EDIT: I tried it myself and it seems to work fine.

JanTvrdik · 2016-09-18T12:44:46Z

src/Tracy/Dumper.php

+					$shortened = TRUE;
+					break;
+				}
+			} while (isset($s[++$i]));


This code block should be only done when the string was not already shortened by mb_substr, i.e.

} elseif ($maxLength && $s !== '') {

should be changed to

} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {

JanTvrdik · 2016-09-18T12:47:41Z

src/Tracy/Dumper.php

+			if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
+				$s = strtr($s, $table);
+			}
+		} elseif (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {


The speed of the preg_match call be made faster by inverting the regexp, i.e. changing

preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()

to

!preg_match('#^[\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]*+\z#u', $s) || preg_last_error()

It is virtually the same for all test cases in my benchmark except for the test case where the whole string is 1 multibyte character repeated N times which is measurably faster with your regexp. I don't really understand why that is, but thanks.

JanTvrdik · 2016-09-18T12:53:10Z

src/Tracy/Dumper.php

+			}
+			if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
+				$s = strtr($s, $table);
+			}


This code duplication can be avoided if you change the following elseif back to if

JanTvrdik · 2016-09-18T12:54:42Z

src/Tracy/Dumper.php

 			if ($shortened = ($maxLength && strlen($s) > $maxLength)) {
 				$s = substr($s, 0, $maxLength);
 			}
 			$s = strtr($s, $table);


After changing the elseif statement back to if, you wil need to change the $shortened variable only when the string is futher shortend, i.e.

if ($maxLength && strlen($s) > $maxLength) { $s = substr($s, 0, $maxLength); $shortened = TRUE; } $s = strtr($s, $table);

JanTvrdik · 2016-09-18T12:56:26Z

src/Tracy/Dumper.php

+				$shortened = $s !== $tmp;
+			} else {
+				$shortened = false;
+			}


This can be simplified by initializing $shortend variable first to FALSE (note the uppercase letters)

$shortened = FALSE; if ($maxLength && strlen($s) > $maxLength && function_exists('mb_substr')) { $s = mb_substr($tmp = $s, 0, $maxLength, 'UTF-8'); $shortened = $s !== $tmp; }

schlndh · 2016-09-18T13:12:25Z

@JanTvrdik Thanks for your comments. It didn't think of checking mb_substr with binary strings before, but it sort of works:

$lengths = [];
for ($i < 0; $i < 1000; ++$i) {
    $s = openssl_random_pseudo_bytes(1000);
    @$lengths[strlen(mb_substr($s, 0, 150, 'UTF-8'))]++;
}
ksort($lengths);
foreach ($lengths as $k => $val) {
    echo $k . " " . str_repeat('|', $val) . "\n";
}

It makes the substring mostly around 200 bytes long, which is fine I guess.

schlndh · 2016-09-18T14:18:49Z

I updated the benchmark with latest changes to this PR.

JanTvrdik · 2016-09-18T15:03:07Z

src/Tracy/Dumper.php

-					}
-				} while (isset($s[++$i]));
-			}
+		} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {


$s !== '' part of the condition can be changed to strlen($s) > $maxLength

JanTvrdik · 2016-09-27T08:15:49Z

@dg Your implementation is a lot slower for long UTF-8 strings.

dg · 2016-09-27T08:34:52Z

As I understand, „the performance problem is in preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) which scans the whole string“, my implementation scans only shortened strings. So it should be faster, or not?

JanTvrdik · 2016-09-27T08:53:35Z

See https://gist.github.com/JanTvrdik/5b8f1e22f71bac00301e74a6ce46d387

schlndh · 2016-09-27T09:09:58Z

@dg I don't have time to test it right now, maybe in the evening, but I think the problem is that you still call preg_match('##u', $s) on the whole string which can be very long, that's why I truncated it first so that I can be sure that it is reasonably short before calling preg_match on it.

dg · 2016-09-27T10:37:41Z

I see, your intention is to remove utf-8 checking at all before shortening.

schlndh · 2016-09-27T10:41:32Z

@dg Yes.

@schlndh

…d string [Closes #223] thanks to @schlndh and @JanTvrdik

dg · 2016-09-27T11:26:03Z

Repushed, now it should be faster when even mb_substr is disabled.

@schlndh

…d string [Closes #223] thanks to @schlndh and @JanTvrdik

JanTvrdik · 2016-09-27T13:51:58Z

Repushed, now it should be faster when even mb_substr is disabled.

Usually, but if you try string str_repeat("\x80", 1e5) with mb_substr disabled it is still slow. Simple fix is to not allow $i > $maxLength * 4

dg · 2016-09-27T13:54:08Z

I know, it is acceptable.

JanTvrdik · 2016-09-27T14:02:04Z

Accepting issue which can be fixed with single line (e.g. $s = substr($s, 0, $maxLength << 2);) of code is unnecessary

@schlndh

…d string [Closes #223] thanks to @schlndh and @JanTvrdik

optimize Dumper::encodeString for long strings

7a1c52b

JanTvrdik suggested changes Sep 18, 2016

View reviewed changes

JanTvrdik reviewed Sep 18, 2016

View reviewed changes

schlndh added 2 commits September 18, 2016 15:54

code polish

5ebce76

use faster regexp

305a931

JanTvrdik approved these changes Sep 18, 2016

View reviewed changes

JanTvrdik reviewed Sep 18, 2016

View reviewed changes

optimize fallback condition

b236c38

dg closed this in 5b83fa4 Sep 26, 2016

schlndh deleted the speedup-Dumper-encodeString branch September 27, 2016 04:39

dg added a commit that referenced this pull request Sep 27, 2016

Dumper::encodeString() optimization, slow regexp is used for shortene…

4c800f2

…d string [Closes #223] thanks to @schlndh and @JanTvrdik

dg added a commit that referenced this pull request Sep 27, 2016

Dumper::encodeString() optimization, slow regexp is used for shortene…

2654def

…d string [Closes #223] thanks to @schlndh and @JanTvrdik

dg added a commit that referenced this pull request Sep 27, 2016

Dumper::encodeString() optimization, slow regexp is used for shortene…

3dee30f

…d string [Closes #223] thanks to @schlndh and @JanTvrdik

Uh oh!

optimize Dumper::encodeString for long strings #223

optimize Dumper::encodeString for long strings #223

Uh oh!

Conversation

schlndh commented Sep 18, 2016

Uh oh!

JanTvrdik commented Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JanTvrdik Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JanTvrdik Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schlndh Sep 18, 2016

Choose a reason for hiding this comment

Uh oh!

JanTvrdik Sep 18, 2016

Choose a reason for hiding this comment

Uh oh!

JanTvrdik Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JanTvrdik Sep 18, 2016

Choose a reason for hiding this comment

Uh oh!

schlndh commented Sep 18, 2016

Uh oh!

schlndh commented Sep 18, 2016

Uh oh!

JanTvrdik Sep 18, 2016

Choose a reason for hiding this comment

Uh oh!

JanTvrdik commented Sep 27, 2016

Uh oh!

dg commented Sep 27, 2016

Uh oh!

JanTvrdik commented Sep 27, 2016

Uh oh!

schlndh commented Sep 27, 2016

Uh oh!

dg commented Sep 27, 2016

Uh oh!

schlndh commented Sep 27, 2016

Uh oh!

dg commented Sep 27, 2016

Uh oh!

JanTvrdik commented Sep 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg commented Sep 27, 2016

Uh oh!

JanTvrdik commented Sep 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JanTvrdik commented Sep 18, 2016 •

edited

Loading

JanTvrdik Sep 18, 2016 •

edited

Loading

JanTvrdik Sep 18, 2016 •

edited

Loading

JanTvrdik Sep 18, 2016 •

edited

Loading

JanTvrdik commented Sep 27, 2016 •

edited

Loading