Skip to content

support passing i128 to assembly on aarch64#154342

Open
folkertdev wants to merge 2 commits intorust-lang:mainfrom
folkertdev:aarch64-asm-i128
Open

support passing i128 to assembly on aarch64#154342
folkertdev wants to merge 2 commits intorust-lang:mainfrom
folkertdev:aarch64-asm-i128

Conversation

@folkertdev
Copy link
Copy Markdown
Contributor

tracking issue: #133416

Like #151059 but for aarch64. LLVM supports this, so I think we should too. I've put this under asm_experimental_reg.

cc @taiki-e
r? @Amanieu

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 24, 2026
@Amanieu
Copy link
Copy Markdown
Member

Amanieu commented Mar 25, 2026

I'm a bit concerned about the behavior on big-endian, but I would expect it to work as if the value was loaded with a single 128-bit LDR Qx. Notably, this is the opposite (byte-swapped) order that you would get if you transmuted it to an i8x16 and then passed it to inline asm.

It would be nice to have tests for this, but run-tests on big-endian aarch64 are a pain to set up. Maybe an assembly test?

Comment on lines +48 to +52
// aarch64_be: rev64 v0.16b, v0.16b
// CHECK: //APP
// CHECK: fmov s{{[0-9]+}}, s{{[0-9]+}}
// CHECK: //NO_APP
// aarch64_be: rev64 v0.16b, v1.16b
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your intuition is right, on aarch64_be for i128 the endianness is swapped, for vector types it is not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generated code is still incorrect: on big-endian x0 will contain the most significant bits of the i128 and x1 will contain the least significant bits. What should happen is this:

fmov d0, x1
mov v0.d[1], x0

This will produce the same result as if the i128 was loaded with ldr q0, [x0].

However here we can see a rev64 instruction that shouldn't be there. Not only are the 64-bit words loaded into the vector register in the wrong order, the bytes inside them are also swapped. This is effectively performing the same operation as the i8x16 example below, which is wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually confused now what the assembly even does. I had naively assumed that fmov would move between vector registers, but that is not actually what it does: it moves between GPR and vector registers. That clearly does not really work for u128. Furthermore the s, i.e. "single" only considers the lowest 32 bits. So I think the assembly using the fmov to move a 128-bit value is just non-sensical, and we should update the aarch64 tests.

When I try something like

#[unsafe(no_mangle)]
fn helper_u128(x: u128) -> u128 {
    unsafe {
        let y;
        std::arch::asm!(
            concat!("mov {:v}.16b, {:v}.16b"),
            out(vreg) y,
            in(vreg) x
        );
        y
    }
}

that works fine for both LE and BE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively performing the same operation as the i8x16 example below, which is wrong.

I have confirmed this locally now. With this program

#![feature(asm_experimental_reg)]
#![feature(portable_simd)]

use std::hint::black_box;
use std::simd::Simd;

const X: u128 = u128::from_le_bytes([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);
const Y: Simd<u8, 16> = unsafe { std::mem::transmute(X) };

fn main() {
    let simd_bytes = |v| unsafe { std::mem::transmute::<_, u128>(v) }.to_le_bytes();

    println!("{:?}", simd_bytes(helper_vec(black_box(Y))));
    println!("{:?}", helper_u128(black_box(X)).to_le_bytes());

    dbg!("done");
}

#[unsafe(no_mangle)]
fn helper_u128(x: u128) -> u128 {
    let y;
    unsafe { std::arch::asm!( "fmov {:s}, {:s}", out(vreg) y, in(vreg) x) }
    y
}

#[unsafe(no_mangle)]
fn helper_vec(x: Simd<u8, 16>) -> Simd<u8, 16> {
    let y;
    unsafe { std::arch::asm!( "fmov {:s}, {:s}", out(vreg) y, in(vreg) x) }
    y
}

I get on aarch64:

[0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

and on aarch64_be

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 14, 15]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 14, 15]

So, that seems consistent to me?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try with this assembly? The correct behavior would be if these passed the value through unmodified.

#[unsafe(no_mangle)]
fn load_u128(x: u128) -> u128 {
    let y;
    unsafe { std::arch::asm!( "ldr {:q}, [{}]", out(vreg) y, in(reg) &x) }
    y
}

#[unsafe(no_mangle)]
fn store_u128(x: u128) -> u128 {
    let mut y = 0;
    unsafe { std::arch::asm!( "str {:q}, [{}]", in(vreg) x, in(reg) &mut y) }
    y
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For aarch64 it does indeed roundtrip

this is input, load and store for u128 and Simd<u8, 16>

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
---
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

For aarch64_be it doesn't roundtrip but the vector and u128 behave the same:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
---
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Presumably this is related to https://llvm.org/docs/BigEndianNEON.html, specifically figure 1 of that document.

// CHECK: //NO_APP
// aarch64: str q1, [x8]
// aarch64_be: st1 { v1.16b }, [x8]
check!(vreg_i8x16 i8x16 vreg "fmov" "s");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add tests for i16x8, i32x4, i64x2, i16x4, i32x4? These should all be loaded using the appropriate ld1/st1 instruction with the correct vector element size (.16b/.8b/.8h/.4h/.4s/.2s/.2d).

Comment on lines +48 to +52
// aarch64_be: rev64 v0.16b, v0.16b
// CHECK: //APP
// CHECK: fmov s{{[0-9]+}}, s{{[0-9]+}}
// CHECK: //NO_APP
// aarch64_be: rev64 v0.16b, v1.16b
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generated code is still incorrect: on big-endian x0 will contain the most significant bits of the i128 and x1 will contain the least significant bits. What should happen is this:

fmov d0, x1
mov v0.d[1], x0

This will produce the same result as if the i128 was loaded with ldr q0, [x0].

However here we can see a rev64 instruction that shouldn't be there. Not only are the 64-bit words loaded into the vector register in the wrong order, the bytes inside them are also swapped. This is effectively performing the same operation as the i8x16 example below, which is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants