Add optimized longest_match for Power processors #459

mscastanho · 2019-12-12T13:50:54Z

Hello again,

This optimization uses VSX vector (SIMD) instructions to try to match multiple bytes at the same time during the search for the longest match. A vector load + comparison (16 bytes) has just a small overhead if compared to their regular versions, so the optimized longest_match tries to match as many bytes as possible on every comparison.

This PR shares 1 commit with #457 and #458, which can be removed if either one gets merged first. It also uses GNU indirect functions to choose which function version (optimized or default) to run on the first call to longest_match during runtime.

To test the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.

The results below show compression throughput in MB/s using RAW deflate, for all compression levels:

pngpixels

comp lvl	default	optimized	gain
1	67.5	73.0	+8.15%
2	59.0	65.3	+10.68%
3	38.8	45.2	+16.49%
4	42.0	46.0	+9.52%
5	26.7	31.6	+18.35%
6	13.8	16.5	+19.57%
7	8.9	10.6	+19.10%
8	2.8	3.4	+21.43%
9	1.3	1.5	+15.38%

jpeg

comp lvl	default	optimized	gain
1	20.0	20.5	+2.50%
2	20.2	20.3	+0.50%
3	20.2	20.3	+0.50%
4	20.3	20.4	+0.49%
5	20.3	20.4	+0.49%
6	20.3	20.4	+0.49%
7	20.3	20.4	+0.49%
8	19.9	20.4	+2.51%
9	20.3	20.4	+0.49%

executable

comp lvl	default	optimized	gain
1	41.2	43.1	+4.61%
2	37.8	39.2	+3.70%
3	28.9	29.9	+3.46%
4	28.3	28.9	+2.12%
5	20.2	21.4	+5.94%
6	12.5	13.1	+4.80%
7	9.5	9.9	+4.21%
8	5.4	5.6	+3.70%
9	4.1	4.2	+2.44%

html

comp lvl	default	optimized	gain
1	43.0	46.2	+7.44%
2	38.5	42.2	+9.61%
3	27.8	30.8	+10.79%
4	28.3	30.8	+8.83%
5	18.1	20.1	+11.05%
6	12.2	13.2	+8.20%
7	10.6	11.4	+7.55%
8	8.0	8.7	+8.75%
9	7.9	8.6	+8.86%

mscastanho · 2020-03-10T20:57:14Z

Force push to add changes to feature detection on configure.

Optimized functions for Power will make use of GNU indirect functions, an extension to support different implementations of the same function, which can be selected during runtime. This will be used to provide optimized functions for different processor versions. Since this is a GNU extension, we placed the definition of the Z_IFUNC macro under `contrib/gcc`. This can be reused by other archs as well. Author: Matheus Castanho <[email protected]> Author: Rogerio Alves <[email protected]>

nmoinvaz · 2022-04-19T03:32:41Z

contrib/power/longest_match_power9.c

+     * bytes where LSB == 0 is the same as counting the length of the match.
+     */
+    #ifdef __LITTLE_ENDIAN__
+    asm volatile("vctzlsbb %0, %1\n\t" : "=r" (len) : "v" (vc));


The assembly in both versions is identical. Is this intended?

Actually I was wrong. One letter off.

This commit introduces an optimized version of the longest_match function for Power processors. It uses VSX instructions to match 16 bytes at a time on each comparison, instead of one by one. Author: Matheus Castanho <[email protected]>

Neustradamus · 2025-05-23T00:40:50Z

A long time ago, I have done this ticket:

IBM Power Processors and Zlib #847

nmoinvaz mentioned this pull request Jan 17, 2020

Add AltiVec-optimized adler32 and slide_hash for PowerPC zlib-ng/zlib-ng#109

Merged

mscastanho mentioned this pull request Feb 3, 2020

Adding CPU features detection code #468

Open

mscastanho force-pushed the longest-match-power branch from 57b7495 to 290ce53 Compare March 10, 2020 20:55

mscastanho mentioned this pull request May 28, 2020

Improve arch detection and add optimized slide_hash for POWER processors zlib-ng/zlib-ng#603

Merged

mscastanho force-pushed the longest-match-power branch from 290ce53 to 5490ed4 Compare April 6, 2022 13:08

nmoinvaz reviewed Apr 19, 2022

View reviewed changes

Add vectorized longest_match for Power

44d19e3

This commit introduces an optimized version of the longest_match function for Power processors. It uses VSX instructions to match 16 bytes at a time on each comparison, instead of one by one. Author: Matheus Castanho <[email protected]>

mscastanho force-pushed the longest-match-power branch from 5490ed4 to 44d19e3 Compare June 13, 2022 17:10

Neustradamus mentioned this pull request Aug 23, 2023

IBM Power Processors and Zlib #847

Open

Neustradamus mentioned this pull request Jan 1, 2025

CMake and Zlib #831

Closed

fneddy mentioned this pull request Feb 25, 2025

IBM S390X contrib cleanup #1050

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optimized longest_match for Power processors #459

Add optimized longest_match for Power processors #459

Uh oh!

mscastanho commented Dec 12, 2019

Uh oh!

mscastanho commented Mar 10, 2020

Uh oh!

nmoinvaz Apr 19, 2022

Uh oh!

nmoinvaz Apr 19, 2022

Uh oh!

Neustradamus commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add optimized longest_match for Power processors #459

Are you sure you want to change the base?

Add optimized longest_match for Power processors #459

Uh oh!

Conversation

mscastanho commented Dec 12, 2019

Uh oh!

mscastanho commented Mar 10, 2020

Uh oh!

nmoinvaz Apr 19, 2022

Choose a reason for hiding this comment

Uh oh!

nmoinvaz Apr 19, 2022

Choose a reason for hiding this comment

Uh oh!

Neustradamus commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants