Implement LoadPairVector64 and LoadPairVector128#64864
Implement LoadPairVector64 and LoadPairVector128#64864echesakov merged 30 commits intodotnet:mainfrom
Conversation
|
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
|
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsnull
|
…formNotSupported.cs
…dvSimd.cs AdvSimd.PlatformNotSupported.cs
…elpers.cs src/tests/JIT/HardwareIntrinsics/Arm/Shared/Helpers.tt
…nsic() in hwintrinsic.cpp
…src/coreclr/jit/hwintrinsicarm64.cpp
…insiccodegenarm64.cpp
…mStructVal() to allow intrinsics returning a struct in importer.cpp
…alues in multiple registers in lsra.h lsraarm64.cpp lsraxarch.cpp
…-reg intrinsic in src/coreclr/jit/morph.cpp
…eclr/jit/morphblock.cpp
|
@dotnet/jit-contrib @tannergooding PTAL |
src/coreclr/jit/gentree.cpp
Outdated
| unreached(); | ||
| } | ||
| #elif defined(TARGET_XARCH) | ||
| return 2; |
There was a problem hiding this comment.
nit: would be good to explicitly cover the intrinsic IDs for xarch as well (definitely could be a separate PR).
(that is switch (intrinsicId) with a default: unreached())
There was a problem hiding this comment.
Right, given that none are supported currently on X86 - I can replace this with unreached() and update with the switch when working on MultiplyNoFlags2 (or whatever name we decided)
src/coreclr/jit/gentree.cpp
Outdated
| case NI_AdvSimd_Arm64_LoadPairVector64: | ||
| case NI_AdvSimd_Arm64_LoadPairVector64NonTemporal: | ||
| case NI_AdvSimd_Arm64_LoadPairVector128: | ||
| case NI_AdvSimd_Arm64_LoadPairVector128NonTemporal: |
There was a problem hiding this comment.
I'll probably see later in the review, but under what conditions are these containable?
I didn't think Arm64 really had ins reg, [mem] operations outside atomic instructions; particularly for non-temporal operations...
There was a problem hiding this comment.
Actually, I don't see where the containment handling is happening for this. I don't see any changes to lowering
There was a problem hiding this comment.
I think I marked them by mistake when experimenting with some ideas I had. Let me update this.
src/coreclr/jit/gentree.h
Outdated
| if (OperIsHWIntrinsic()) | ||
| { | ||
| return (TypeGet() == TYP_STRUCT); | ||
| return TypeIs(TYP_STRUCT); |
There was a problem hiding this comment.
I wonder how well this will hold in the future? That is, are we expecting things that return TYP_STRUCT to always be multi-reg?
There are cases like say System.Half where we'll eventually need to add support and we'll need to treat it as TYP_HALF or recognize it some other way if we don't want it to be an issue.
I wonder if we should track this as a flag rather than by type just to be safe?
There was a problem hiding this comment.
Ah, looks like we have a flag. Maybe we should assert it or use it here instead?
There was a problem hiding this comment.
Sure, I can use a flag here.
src/coreclr/jit/gentree.h
Outdated
| if (OperIsHWIntrinsic()) | ||
| { | ||
| assert(TypeGet() == TYP_STRUCT); | ||
| #ifdef TARGET_ARM64 | ||
| const GenTreeHWIntrinsic* intrinsic = AsHWIntrinsic(); | ||
| const NamedIntrinsic intrinsicId = intrinsic->GetHWIntrinsicId(); | ||
|
|
||
| switch (intrinsicId) | ||
| { | ||
| // TODO-ARM64-NYI: Support hardware intrinsics operating on multiple contiguous registers. | ||
| case NI_AdvSimd_Arm64_LoadPairScalarVector64: | ||
| case NI_AdvSimd_Arm64_LoadPairScalarVector64NonTemporal: | ||
| case NI_AdvSimd_Arm64_LoadPairVector64: | ||
| case NI_AdvSimd_Arm64_LoadPairVector64NonTemporal: | ||
| case NI_AdvSimd_Arm64_LoadPairVector128: | ||
| case NI_AdvSimd_Arm64_LoadPairVector128NonTemporal: | ||
| return 2; | ||
|
|
||
| default: | ||
| unreached(); | ||
| } | ||
| #elif defined(TARGET_XARCH) | ||
| return 2; | ||
| } | ||
| #endif | ||
| } |
There was a problem hiding this comment.
This looks to be the same logic as in gentree.cpp above. Should it be factored out into a shared helper or are there conditions where they won't/shouldn't be in sync?
| else | ||
| { | ||
| assert(AsHWIntrinsic()->GetSimdSize() == 8); | ||
| return TYP_SIMD8; |
There was a problem hiding this comment.
Do we have any cases of "arg size is 8" but "return size is 16" or vice-versa?
I know some instructions fit that bill, I'm not sure if any of the multi-reg cases will
There was a problem hiding this comment.
I don't see an example on Arm64 when this wouldn't hold.
ld[1-4] should be similar to ldp.
As for tbl and tbx:
TBX <Vd>.<Ta>, { <Vn>.16B, <Vn+1>.16B, <Vn+2>.16B }, <Vm>.<Ta>
the return value is going to be single-reg but the first source operand is multi-reg and composed of Vector128<byte>.
|
Changes generally LGTM. Left some comments/questions on a couple bits on how we expect certain checks to work long term. Would be great to see a couple diffs/codegen examples particularly for the library changes. |
…tree.h src/coreclr/jit/gentree.cpp
@tannergooding There are suboptimalities in the code using To fix that I would need to implement #64863 and #64857. If I apply these changes on top of this PR the code diffs would look as expected: GetIndexOfFirstCharToEncodeAdvSimd64 @@ -69,10 +73,8 @@ G_M52035_IG03:
;; bbWeight=0.50 PerfScore 0.25
G_M52035_IG04:
add x6, x1, x4, LSL #1
- ld1 {v20.8h}, [x6]
+ ldp q20, q21, [x6]
sqxtun v20.8b, v20.8h
- add x6, x6, #16
- ld1 {v21.8h}, [x6]
sqxtun2 v20.16b, v21.8h
and v21.16b, v20.16b, v16.16b
tbl v21.16b, {v19.16b}, v21.16b
@@ -87,7 +89,7 @@ G_M52035_IG04:
add x4, x4, #16
cmp x4, x5
blo G_M52035_IG04
- ;; bbWeight=4 PerfScore 100.00
+ ;; bbWeight=4 PerfScore 86.00GetIndexOfFirstNonAsciiByte_Intrinsified @@ -208,9 +211,8 @@ G_M18966_IG11:
@@ -208,9 +211,7 @@ G_M18966_IG11:
sub x22, x0, #32
;; bbWeight=0.50 PerfScore 1.75
G_M18966_IG12:
- ld1 {v16.16b}, [x19]
- add x0, x19, #16
- ld1 {v10.16b}, [x0]
+ ldp q16, q9, [x19]
sshr v16.16b, v16.16b, #7
and v16.16b, v16.16b, v8.16b
addp v16.16b, v16.16b, v16.16bHere is my plan:
I will keep the changes to BitArray:CopyTo though @@ -413,18 +414,16 @@ G_M40488_IG18:
zip1 v19.16b, v18.16b, v18.16b
and v19.16b, v19.16b, v17.16b
umin v19.16b, v19.16b, v16.16b
- st1 {v19.16b}, [x3]
zip2 v18.16b, v18.16b, v18.16b
and v18.16b, v18.16b, v17.16b
umin v18.16b, v18.16b, v16.16b
- add x3, x3, #16
- st1 {v18.16b}, [x3]
+ stp q19, q18, [x3]
add w1, w1, #32
add w3, w1, #32
ldr w4, [x19,#16]
cmp w3, w4
bls G_M40488_IG18 |
|
@dotnet/jit-contrib PTAL |
I'm 23 hours late to this but: the changes still look good to me (although I don't know if me signing off on my own implementation is kosher) |
Resolves #39243