crypto.sha3: rewrite and optimize kaccak_p_1600_24() engine, update tests#26524

tankf33der · 2026-02-05T15:14:24Z

I finally want to show the patch for accelerating sha3 performance.
This is approximately the 4th generation patch from a multi-week development and fun.
It all started with a patch that speeds up by 10%, and ended up with a multi-fold speedup for both tcc and gcc.

If you take my standard file for sha3 performance testing, you can see multiple function calls inside the rounds, once I conquered that it was just a matter of technique.

import crypto.sha3
import time

fn main() {
	a := []u8{len: 10_000_000}
	t1 := time.now()
	_ := sha3.sum512(a)
	println(time.since(t1))
}

        138889         93.624ms         46.706ms            674ns crypto__sha3__State_xor_bytes 
       1250001         46.917ms         46.917ms             38ns encoding__binary__little_endian_u64_at 
       3333336         83.607ms         83.607ms             25ns crypto__sha3__State_iota 
        138889       8219.910ms        101.634ms          59183ns crypto__sha3__State_kaccak_p_1600_24 
       3333336        522.927ms        522.927ms            157ns crypto__sha3__State_pi 
       3333336       8118.276ms        556.868ms           2435ns crypto__sha3__State_rnd 
       3333336        684.678ms        684.678ms            205ns crypto__sha3__State_chi 
       3333336       1454.097ms       1026.246ms            436ns crypto__sha3__State_theta 
     100000080       2475.980ms       2475.980ms             25ns math__bits__rotate_left_64 
       3333336       4816.100ms       2767.971ms           1445ns crypto__sha3__State_rho

and even if you check whether the compiler inlined them, it still turns out to be costly.
Besides, the official site suggests merging several functions into one and then they are not needed at all.
The latest generation of the patch consists of simply unrolling the loops and making them less costly.
Had to tinker with it.
I have my own tests with full coverage for files with test vectors and openssl calls so I'm not worried.

Now the profiler shows normal metrics:

             2          0.010ms          0.010ms           5018ns builtin___write_buf_to_fd 
             2          0.010ms          0.010ms           5174ns builtin___v_malloc 
             2          0.019ms          0.017ms           9376ns time__linux_now 
             6          1.239ms          1.239ms         206538ns builtin__vcalloc_noscan 
        277779         10.739ms         10.739ms             39ns builtin__array_slice 
             1       5363.798ms         18.982ms     5363798118ns crypto__sha3__Digest_write 
        138889         91.799ms         45.508ms            661ns crypto__sha3__State_xor_bytes 
       1250001         46.292ms         46.292ms             37ns encoding__binary__little_endian_u64_at 
      96666744       2336.159ms       2336.159ms             24ns math__bits__rotate_left_64 
        138889       5242.316ms       2906.158ms          37745ns crypto__sha3__State_kaccak_p_1600_24

Had to sacrifice some tests because they became impossible, there's simply no code that they rely on.

Speed up: tcc ~4.5+ times, gcc ~3+ times

…ests

tankf33der · 2026-02-05T15:19:56Z

@blackshirt take a look. Of course I've tested it with your pslhdsa implementation.

tankf33der · 2026-02-05T15:21:52Z

@kimshrier - take a look. What you think?

spytheman · 2026-02-06T05:13:43Z

On my m1:

using a variation of this (if someone needs to re-check on another machine):

branch=$(git rev-parse --abbrev-ref HEAD); for compiler in tcc clang gcc-15; do bname=sha_${compiler}_${branch}; v -cc $compiler -o $bname sha.v && ll $bname && xtime ./$bname; done

spytheman

Excellent work.
Thank you @tankf33der 🙇🏻 .

spytheman · 2026-02-06T05:19:49Z

I have my own tests with full coverage for files with test vectors and openssl calls so I'm not worried.

Can you please submit some of them to https://github.com/vlang/slower_tests (it is a separate repo, but it is also tested by the main CI)?

kimshrier · 2026-02-06T07:29:00Z

Thanks for improving the performance. I did a very straight forward implementation and did not have time to optimize it. I was more concerned with having it be correct.

I have been preoccupied with other, personal, stuff and this will continue to be the case for several more months. I am glad that you took the time to make it better.

medvednikov · 2026-02-06T09:40:25Z

Amazing work!

…ests (vlang#26524)

crypto.sha3: rewrite and optimize kaccak_p_1600_24() engine, update t…

375e3fe

…ests

spytheman approved these changes Feb 6, 2026

View reviewed changes

spytheman merged commit 65cf633 into vlang:master Feb 6, 2026
77 of 80 checks passed

cestef pushed a commit to cestef/v that referenced this pull request Mar 9, 2026

crypto.sha3: rewrite and optimize kaccak_p_1600_24() engine, update t…

116131b

…ests (vlang#26524)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

crypto.sha3: rewrite and optimize kaccak_p_1600_24() engine, update tests#26524