Skip to content

Conversation

@schlndh
Copy link
Contributor

@schlndh schlndh commented Sep 18, 2016

  • feature
  • issues - none
  • documentation - not needed
  • BC break - yes

The purpose of this PR is to speedup dumping of large strings for reasonable maxLength. The performance problem is in preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) which scans the whole string. By preferring mb_substr it is possible to truncate the string before running it through the regex and thus potentially saving a lot of time for large strings. However this is a slight BC break, because now it won't recognize some binary strings it did recognize before, but since the binary part will be outside the truncated string I think it is worth it.

As you can see from the benchmark I made the performance impact for short strings or large strings with unlimited maxLength is minimal or none, however the performance gain for long strings with small maxLength is massive.

The benchmark compares current encodeString implementation (encodeStringOrig) with implementation from this PR (encodeStringChanged) and just running the preg_match on the whole string. The first column (N) is the number of characters and the values of other columns are average times for one call on given string in nanoseconds.

Benchmarking was done on Linux with PHP 7.0.10

@JanTvrdik
Copy link
Contributor

JanTvrdik commented Sep 18, 2016

How will mb_substr handle binary (not UTF-8) strings?


EDIT: I tried it myself and it seems to work fine.

$shortened = TRUE;
break;
}
} while (isset($s[++$i]));
Copy link
Contributor

@JanTvrdik JanTvrdik Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block should be only done when the string was not already shortened by mb_substr, i.e.

} elseif ($maxLength && $s !== '') {

should be changed to

} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {

if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
$s = strtr($s, $table);
}
} elseif (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
Copy link
Contributor

@JanTvrdik JanTvrdik Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed of the preg_match call be made faster by inverting the regexp, i.e. changing

preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()

to

!preg_match('#^[\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]*+\z#u', $s) || preg_last_error()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is virtually the same for all test cases in my benchmark except for the test case where the whole string is 1 multibyte character repeated N times which is measurably faster with your regexp. I don't really understand why that is, but thanks.

}
if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
$s = strtr($s, $table);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code duplication can be avoided if you change the following elseif back to if

if ($shortened = ($maxLength && strlen($s) > $maxLength)) {
$s = substr($s, 0, $maxLength);
}
$s = strtr($s, $table);
Copy link
Contributor

@JanTvrdik JanTvrdik Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing the elseif statement back to if, you wil need to change the $shortened variable only when the string is futher shortend, i.e.

if ($maxLength && strlen($s) > $maxLength) {
    $s = substr($s, 0, $maxLength);
    $shortened = TRUE;
}
$s = strtr($s, $table);

$shortened = $s !== $tmp;
} else {
$shortened = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified by initializing $shortend variable first to FALSE (note the uppercase letters)

$shortened = FALSE;
if ($maxLength && strlen($s) > $maxLength && function_exists('mb_substr')) {
    $s = mb_substr($tmp = $s, 0, $maxLength, 'UTF-8');
    $shortened = $s !== $tmp;
}

@schlndh
Copy link
Contributor Author

schlndh commented Sep 18, 2016

@JanTvrdik Thanks for your comments. It didn't think of checking mb_substr with binary strings before, but it sort of works:

$lengths = [];
for ($i < 0; $i < 1000; ++$i) {
    $s = openssl_random_pseudo_bytes(1000);
    @$lengths[strlen(mb_substr($s, 0, 150, 'UTF-8'))]++;
}
ksort($lengths);
foreach ($lengths as $k => $val) {
    echo $k . " " . str_repeat('|', $val) . "\n";
}

It makes the substring mostly around 200 bytes long, which is fine I guess.

@schlndh
Copy link
Contributor Author

schlndh commented Sep 18, 2016

I updated the benchmark with latest changes to this PR.

}
} while (isset($s[++$i]));
}
} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$s !== '' part of the condition can be changed to strlen($s) > $maxLength

@dg dg closed this in 5b83fa4 Sep 26, 2016
@schlndh schlndh deleted the speedup-Dumper-encodeString branch September 27, 2016 04:39
@JanTvrdik
Copy link
Contributor

@dg Your implementation is a lot slower for long UTF-8 strings.

@dg
Copy link
Member

dg commented Sep 27, 2016

As I understand, „the performance problem is in preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) which scans the whole string“, my implementation scans only shortened strings. So it should be faster, or not?

@JanTvrdik
Copy link
Contributor

@schlndh
Copy link
Contributor Author

schlndh commented Sep 27, 2016

@dg I don't have time to test it right now, maybe in the evening, but I think the problem is that you still call preg_match('##u', $s) on the whole string which can be very long, that's why I truncated it first so that I can be sure that it is reasonably short before calling preg_match on it.

@dg
Copy link
Member

dg commented Sep 27, 2016

I see, your intention is to remove utf-8 checking at all before shortening.

@schlndh
Copy link
Contributor Author

schlndh commented Sep 27, 2016

@dg Yes.

dg added a commit that referenced this pull request Sep 27, 2016
@dg
Copy link
Member

dg commented Sep 27, 2016

Repushed, now it should be faster when even mb_substr is disabled.

dg added a commit that referenced this pull request Sep 27, 2016
@JanTvrdik
Copy link
Contributor

JanTvrdik commented Sep 27, 2016

Repushed, now it should be faster when even mb_substr is disabled.

Usually, but if you try string str_repeat("\x80", 1e5) with mb_substr disabled it is still slow. Simple fix is to not allow $i > $maxLength * 4

@dg
Copy link
Member

dg commented Sep 27, 2016

I know, it is acceptable.

@JanTvrdik
Copy link
Contributor

Accepting issue which can be fixed with single line (e.g. $s = substr($s, 0, $maxLength << 2);) of code is unnecessary

dg added a commit that referenced this pull request Sep 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants