support passing `i128` to assembly on `aarch64`#154342

folkertdev · 2026-03-24T22:16:31Z

tracking issue: #133416

Like #151059 but for aarch64. LLVM supports this, so I think we should too. I've put this under asm_experimental_reg.

cc @taiki-e
r? @Amanieu

Amanieu · 2026-03-25T19:40:59Z

I'm a bit concerned about the behavior on big-endian, but I would expect it to work as if the value was loaded with a single 128-bit LDR Qx. Notably, this is the opposite (byte-swapped) order that you would get if you transmuted it to an i8x16 and then passed it to inline asm.

It would be nice to have tests for this, but run-tests on big-endian aarch64 are a pain to set up. Maybe an assembly test?

folkertdev · 2026-03-25T20:38:23Z

tests/assembly-llvm/asm/aarch64-i128.rs

+// aarch64_be: rev64 v0.16b, v0.16b
+// CHECK: //APP
+// CHECK: fmov s{{[0-9]+}}, s{{[0-9]+}}
+// CHECK: //NO_APP
+// aarch64_be: rev64 v0.16b, v1.16b


your intuition is right, on aarch64_be for i128 the endianness is swapped, for vector types it is not.

I think the generated code is still incorrect: on big-endian x0 will contain the most significant bits of the i128 and x1 will contain the least significant bits. What should happen is this:

fmov d0, x1 mov v0.d[1], x0

This will produce the same result as if the i128 was loaded with ldr q0, [x0].

However here we can see a rev64 instruction that shouldn't be there. Not only are the 64-bit words loaded into the vector register in the wrong order, the bytes inside them are also swapped. This is effectively performing the same operation as the i8x16 example below, which is wrong.

I'm actually confused now what the assembly even does. I had naively assumed that fmov would move between vector registers, but that is not actually what it does: it moves between GPR and vector registers. That clearly does not really work for u128. Furthermore the s, i.e. "single" only considers the lowest 32 bits. So I think the assembly using the fmov to move a 128-bit value is just non-sensical, and we should update the aarch64 tests.

When I try something like

#[unsafe(no_mangle)] fn helper_u128(x: u128) -> u128 { unsafe { let y; std::arch::asm!( concat!("mov {:v}.16b, {:v}.16b"), out(vreg) y, in(vreg) x ); y } }

that works fine for both LE and BE.

This is effectively performing the same operation as the i8x16 example below, which is wrong.

I have confirmed this locally now. With this program

#![feature(asm_experimental_reg)] #![feature(portable_simd)] use std::hint::black_box; use std::simd::Simd; const X: u128 = u128::from_le_bytes([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]); const Y: Simd<u8, 16> = unsafe { std::mem::transmute(X) }; fn main() { let simd_bytes = |v| unsafe { std::mem::transmute::<_, u128>(v) }.to_le_bytes(); println!("{:?}", simd_bytes(helper_vec(black_box(Y)))); println!("{:?}", helper_u128(black_box(X)).to_le_bytes()); dbg!("done"); } #[unsafe(no_mangle)] fn helper_u128(x: u128) -> u128 { let y; unsafe { std::arch::asm!( "fmov {:s}, {:s}", out(vreg) y, in(vreg) x) } y } #[unsafe(no_mangle)] fn helper_vec(x: Simd<u8, 16>) -> Simd<u8, 16> { let y; unsafe { std::arch::asm!( "fmov {:s}, {:s}", out(vreg) y, in(vreg) x) } y }

I get on aarch64:

[0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

and on aarch64_be

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 14, 15] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 14, 15]

So, that seems consistent to me?

Can you try with this assembly? The correct behavior would be if these passed the value through unmodified.

#[unsafe(no_mangle)] fn load_u128(x: u128) -> u128 { let y; unsafe { std::arch::asm!( "ldr {:q}, [{}]", out(vreg) y, in(reg) &x) } y } #[unsafe(no_mangle)] fn store_u128(x: u128) -> u128 { let mut y = 0; unsafe { std::arch::asm!( "str {:q}, [{}]", in(vreg) x, in(reg) &mut y) } y }

For aarch64 it does indeed roundtrip

this is input, load and store for u128 and Simd<u8, 16>

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] --- [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

For aarch64_be it doesn't roundtrip but the vector and u128 behave the same:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] [15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] --- [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] [15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Presumably this is related to https://llvm.org/docs/BigEndianNEON.html, specifically figure 1 of that document.

Amanieu · 2026-03-25T20:56:37Z

tests/assembly-llvm/asm/aarch64-i128.rs

+// CHECK: //NO_APP
+// aarch64: str q1, [x8]
+// aarch64_be: st1 { v1.16b }, [x8]
+check!(vreg_i8x16 i8x16 vreg "fmov" "s");


Can you also add tests for i16x8, i32x4, i64x2, i16x4, i32x4? These should all be loaded using the appropriate ld1/st1 instruction with the correct vector element size (.16b/.8b/.8h/.4h/.4s/.2s/.2d).

Amanieu · 2026-03-25T20:57:31Z

tests/assembly-llvm/asm/aarch64-i128.rs

+// aarch64_be: rev64 v0.16b, v0.16b
+// CHECK: //APP
+// CHECK: fmov s{{[0-9]+}}, s{{[0-9]+}}
+// CHECK: //NO_APP
+// aarch64_be: rev64 v0.16b, v1.16b


I think the generated code is still incorrect: on big-endian x0 will contain the most significant bits of the i128 and x1 will contain the least significant bits. What should happen is this:

fmov d0, x1 mov v0.d[1], x0

This will produce the same result as if the i128 was loaded with ldr q0, [x0].

However here we can see a rev64 instruction that shouldn't be there. Not only are the 64-bit words loaded into the vector register in the wrong order, the bytes inside them are also swapped. This is effectively performing the same operation as the i8x16 example below, which is wrong.

support passing i128 to assembly on aarch64

6b5bb2e

rustbot assigned Amanieu Mar 24, 2026

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 24, 2026

taiki-e mentioned this pull request Mar 25, 2026

Tracking Issue for asm_experimental_reg #133416

Open

2 tasks

add test for passing i128 to assembly on aarch64_be

9e8731e

folkertdev commented Mar 25, 2026

View reviewed changes

Amanieu reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support passing `i128` to assembly on `aarch64`#154342