support passing i128 to assembly on aarch64#154342
support passing i128 to assembly on aarch64#154342folkertdev wants to merge 2 commits intorust-lang:mainfrom
i128 to assembly on aarch64#154342Conversation
|
I'm a bit concerned about the behavior on big-endian, but I would expect it to work as if the value was loaded with a single 128-bit LDR Qx. Notably, this is the opposite (byte-swapped) order that you would get if you transmuted it to an It would be nice to have tests for this, but run-tests on big-endian aarch64 are a pain to set up. Maybe an assembly test? |
| // aarch64_be: rev64 v0.16b, v0.16b | ||
| // CHECK: //APP | ||
| // CHECK: fmov s{{[0-9]+}}, s{{[0-9]+}} | ||
| // CHECK: //NO_APP | ||
| // aarch64_be: rev64 v0.16b, v1.16b |
There was a problem hiding this comment.
your intuition is right, on aarch64_be for i128 the endianness is swapped, for vector types it is not.
There was a problem hiding this comment.
I think the generated code is still incorrect: on big-endian x0 will contain the most significant bits of the i128 and x1 will contain the least significant bits. What should happen is this:
fmov d0, x1
mov v0.d[1], x0This will produce the same result as if the i128 was loaded with ldr q0, [x0].
However here we can see a rev64 instruction that shouldn't be there. Not only are the 64-bit words loaded into the vector register in the wrong order, the bytes inside them are also swapped. This is effectively performing the same operation as the i8x16 example below, which is wrong.
There was a problem hiding this comment.
I'm actually confused now what the assembly even does. I had naively assumed that fmov would move between vector registers, but that is not actually what it does: it moves between GPR and vector registers. That clearly does not really work for u128. Furthermore the s, i.e. "single" only considers the lowest 32 bits. So I think the assembly using the fmov to move a 128-bit value is just non-sensical, and we should update the aarch64 tests.
When I try something like
#[unsafe(no_mangle)]
fn helper_u128(x: u128) -> u128 {
unsafe {
let y;
std::arch::asm!(
concat!("mov {:v}.16b, {:v}.16b"),
out(vreg) y,
in(vreg) x
);
y
}
}that works fine for both LE and BE.
There was a problem hiding this comment.
This is effectively performing the same operation as the i8x16 example below, which is wrong.
I have confirmed this locally now. With this program
#![feature(asm_experimental_reg)]
#![feature(portable_simd)]
use std::hint::black_box;
use std::simd::Simd;
const X: u128 = u128::from_le_bytes([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);
const Y: Simd<u8, 16> = unsafe { std::mem::transmute(X) };
fn main() {
let simd_bytes = |v| unsafe { std::mem::transmute::<_, u128>(v) }.to_le_bytes();
println!("{:?}", simd_bytes(helper_vec(black_box(Y))));
println!("{:?}", helper_u128(black_box(X)).to_le_bytes());
dbg!("done");
}
#[unsafe(no_mangle)]
fn helper_u128(x: u128) -> u128 {
let y;
unsafe { std::arch::asm!( "fmov {:s}, {:s}", out(vreg) y, in(vreg) x) }
y
}
#[unsafe(no_mangle)]
fn helper_vec(x: Simd<u8, 16>) -> Simd<u8, 16> {
let y;
unsafe { std::arch::asm!( "fmov {:s}, {:s}", out(vreg) y, in(vreg) x) }
y
}I get on aarch64:
[0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
and on aarch64_be
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 14, 15]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 14, 15]
So, that seems consistent to me?
There was a problem hiding this comment.
Can you try with this assembly? The correct behavior would be if these passed the value through unmodified.
#[unsafe(no_mangle)]
fn load_u128(x: u128) -> u128 {
let y;
unsafe { std::arch::asm!( "ldr {:q}, [{}]", out(vreg) y, in(reg) &x) }
y
}
#[unsafe(no_mangle)]
fn store_u128(x: u128) -> u128 {
let mut y = 0;
unsafe { std::arch::asm!( "str {:q}, [{}]", in(vreg) x, in(reg) &mut y) }
y
}There was a problem hiding this comment.
For aarch64 it does indeed roundtrip
this is input, load and store for u128 and Simd<u8, 16>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
---
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
For aarch64_be it doesn't roundtrip but the vector and u128 behave the same:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
---
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Presumably this is related to https://llvm.org/docs/BigEndianNEON.html, specifically figure 1 of that document.
| // CHECK: //NO_APP | ||
| // aarch64: str q1, [x8] | ||
| // aarch64_be: st1 { v1.16b }, [x8] | ||
| check!(vreg_i8x16 i8x16 vreg "fmov" "s"); |
There was a problem hiding this comment.
Can you also add tests for i16x8, i32x4, i64x2, i16x4, i32x4? These should all be loaded using the appropriate ld1/st1 instruction with the correct vector element size (.16b/.8b/.8h/.4h/.4s/.2s/.2d).
| // aarch64_be: rev64 v0.16b, v0.16b | ||
| // CHECK: //APP | ||
| // CHECK: fmov s{{[0-9]+}}, s{{[0-9]+}} | ||
| // CHECK: //NO_APP | ||
| // aarch64_be: rev64 v0.16b, v1.16b |
There was a problem hiding this comment.
I think the generated code is still incorrect: on big-endian x0 will contain the most significant bits of the i128 and x1 will contain the least significant bits. What should happen is this:
fmov d0, x1
mov v0.d[1], x0This will produce the same result as if the i128 was loaded with ldr q0, [x0].
However here we can see a rev64 instruction that shouldn't be there. Not only are the 64-bit words loaded into the vector register in the wrong order, the bytes inside them are also swapped. This is effectively performing the same operation as the i8x16 example below, which is wrong.
tracking issue: #133416
Like #151059 but for
aarch64. LLVM supports this, so I think we should too. I've put this underasm_experimental_reg.cc @taiki-e
r? @Amanieu