Skip to content

Conversation

@mscastanho
Copy link

Hello again,

This optimization uses VSX vector (SIMD) instructions to try to match multiple bytes at the same time during the search for the longest match. A vector load + comparison (16 bytes) has just a small overhead if compared to their regular versions, so the optimized longest_match tries to match as many bytes as possible on every comparison.

This PR shares 1 commit with #457 and #458, which can be removed if either one gets merged first. It also uses GNU indirect functions to choose which function version (optimized or default) to run on the first call to longest_match during runtime.

To test the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.

The results below show compression throughput in MB/s using RAW deflate, for all compression levels:

  • pngpixels

    comp lvl default optimized gain
    1 67.5 73.0 +8.15%
    2 59.0 65.3 +10.68%
    3 38.8 45.2 +16.49%
    4 42.0 46.0 +9.52%
    5 26.7 31.6 +18.35%
    6 13.8 16.5 +19.57%
    7 8.9 10.6 +19.10%
    8 2.8 3.4 +21.43%
    9 1.3 1.5 +15.38%
  • jpeg

    comp lvl default optimized gain
    1 20.0 20.5 +2.50%
    2 20.2 20.3 +0.50%
    3 20.2 20.3 +0.50%
    4 20.3 20.4 +0.49%
    5 20.3 20.4 +0.49%
    6 20.3 20.4 +0.49%
    7 20.3 20.4 +0.49%
    8 19.9 20.4 +2.51%
    9 20.3 20.4 +0.49%
  • executable

    comp lvl default optimized gain
    1 41.2 43.1 +4.61%
    2 37.8 39.2 +3.70%
    3 28.9 29.9 +3.46%
    4 28.3 28.9 +2.12%
    5 20.2 21.4 +5.94%
    6 12.5 13.1 +4.80%
    7 9.5 9.9 +4.21%
    8 5.4 5.6 +3.70%
    9 4.1 4.2 +2.44%
  • html

    comp lvl default optimized gain
    1 43.0 46.2 +7.44%
    2 38.5 42.2 +9.61%
    3 27.8 30.8 +10.79%
    4 28.3 30.8 +8.83%
    5 18.1 20.1 +11.05%
    6 12.2 13.2 +8.20%
    7 10.6 11.4 +7.55%
    8 8.0 8.7 +8.75%
    9 7.9 8.6 +8.86%

@mscastanho
Copy link
Author

Force push to add changes to feature detection on configure.

Optimized functions for Power will make use of GNU indirect functions,
an extension to support different implementations of the same function,
which can be selected during runtime. This will be used to provide
optimized functions for different processor versions.

Since this is a GNU extension, we placed the definition of the Z_IFUNC
macro under `contrib/gcc`. This can be reused by other archs as well.

Author: Matheus Castanho <[email protected]>
Author: Rogerio Alves <[email protected]>
@mscastanho mscastanho force-pushed the longest-match-power branch from 290ce53 to 5490ed4 Compare April 6, 2022 13:08
* bytes where LSB == 0 is the same as counting the length of the match.
*/
#ifdef __LITTLE_ENDIAN__
asm volatile("vctzlsbb %0, %1\n\t" : "=r" (len) : "v" (vc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assembly in both versions is identical. Is this intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was wrong. One letter off.

This commit introduces an optimized version of the longest_match
function for Power processors. It uses VSX instructions to match
16 bytes at a time on each comparison, instead of one by one.

Author: Matheus Castanho <[email protected]>
@mscastanho mscastanho force-pushed the longest-match-power branch from 5490ed4 to 44d19e3 Compare June 13, 2022 17:10
@Neustradamus Neustradamus mentioned this pull request Jan 1, 2025
@fneddy fneddy mentioned this pull request Feb 25, 2025
10 tasks
@Neustradamus
Copy link

A long time ago, I have done this ticket:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants