JIT: Unblock Vector###<long> intrinsics on x86#112728

saucecontrol · 2025-02-20T05:25:43Z

Resolves #11626

This resolves a large number of TODOs around HWIntrinsic expansion involving scalar longs on x86.

The most significant change here is in promoting CreateScalar and ToScalar to be code generating intrinsics instead of converting them to other intrinsics at lowering. This was necessary in order to handle emitting movq for scalar long loads/stores but also unlocks several other optimizations since we can now allow CreateScalar and ToScalar to be contained and can specialize codegen depending on whether they end up loading/storing from/to memory or not. Some example improvements on x64:

Vector128.CreateScalar(ref float):

-       vinsertps xmm0, xmm0, dword ptr [rbp+0x10], 14
+       vmovss   xmm0, dword ptr [rbp+0x10]

Vector128.CreateScalar(ref double):

-       vxorps   xmm0, xmm0, xmm0
-       vmovsd   xmm1, qword ptr [rbp-0x08]
-       vmovsd   xmm0, xmm0, xmm1
+       vmovsd   xmm0, qword ptr [rbp-0x08]

ref byte = Vector128<byte>.ToScalar():

-       vmovd    r9d, xmm3
-       mov      byte  ptr [r10], r9b
+       vpextrb  byte  ptr [r10], xmm3, 0

Vector<byte>.ToScalar()

-       vmovups  ymm0, ymmword ptr [esp+0x04]
-       vmovd    eax, xmm0
-       movzx    eax, al
+       movzx    eax, byte  ptr [esp+0x04]

And the less realistic, but still interesting
Sse.AddScalar(Vector128.CreateScalar(ref float), Vector128.CreateScalar(ref float)).ToScalar():

-       xorps    xmm0, xmm0
-       movss    xmm1, dword ptr [rcx]
-       movss    xmm0, xmm1
-       xorps    xmm1, xmm1
-       movss    xmm2, dword ptr [rdx]
-       movss    xmm1, xmm2
-       addss    xmm0, xmm1
+       movss    xmm0, dword ptr [rcx]
+       addss    xmm0, dword ptr [rdx]

This also removes some redundant casts for CreateScalar of small types. Previously, a zero-extending cast was inserted unconditionally and was sometimes removed by peephole opt on x64 but often wasn't.

Vector128.CreateScalar(short):

-       movsx    rax, dx
-       movzx    rax, ax
-       movd     xmm0, rax
+       movzx    rax, dx
+       movd     xmm0, eax

Vector128.CreateScalar(checked((byte)val)):

        cmp      edx, 255
        ja       SHORT G_M000_IG04
        mov      eax, edx
-       movzx    rax, al
-       vmovd    xmm0, rax
+       vmovd    xmm0, eax

Vector128.CreateScalar(ref sbyte):

-       movsx    rax, byte  ptr [rdx]
-       movzx    rax, al
-       vmovd    xmm0, rax
+       movzx    rax, byte  ptr [rdx]
+       vmovd    xmm0, eax

x86 diffs are much more significant, because of the newly-enabled intrinsic expansion:

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
benchmarks.run.windows.x86.checked.mch	7,149,204	-1,892	-2.17%
benchmarks.run_pgo.windows.x86.checked.mch	46,986,713	-738	+0.03%
benchmarks.run_tiered.windows.x86.checked.mch	9,470,045	-976	+0.11%
coreclr_tests.run.windows.x86.checked.mch	320,065,247	-205,564	-6.41%
libraries.crossgen2.windows.x86.checked.mch	31,314,339	-15,854	-4.11%
libraries.pmi.windows.x86.checked.mch	34,326,245	-14,416	-2.19%
libraries_tests.run.windows.x86.Release.mch	215,517,600	-55,366	-2.41%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	115,783,488	-80,576	-3.65%
realworld.run.windows.x86.checked.mch	9,587,950	-467	-0.45%

saucecontrol

This is ready for review.
cc @tannergooding

saucecontrol · 2025-02-22T00:07:49Z

src/coreclr/jit/lower.h

-        // Keep casts with operands usable from memory.
-        if (castOp->isContained() || castOp->IsRegOptional())
-        {
-            return op;
-        }


This condition, added in #72719, made this method effectively useless. Removing it was a zero-diff change. I can look in future at containing the casts rather than removing them.

saucecontrol · 2025-02-22T00:11:13Z

src/coreclr/jit/lowerxarch.cpp


-    GenTree* op2 = node->Op(2);
-
-    // TODO-XArch-AVX512 : Merge the NI_Vector512_Create and NI_Vector256_Create paths below.


The churn in this section is just taking care of this TODO

saucecontrol · 2025-02-22T00:13:48Z

src/coreclr/jit/lowerxarch.cpp

            tmp2 = InsertNewSimdCreateScalarUnsafeNode(TYP_SIMD16, op2, simdBaseJitType, 16);
            LowerNode(tmp2);

-            node->ResetHWIntrinsicId(NI_SSE_MoveLowToHigh, tmp1, tmp2);


Changing this to UnpackLow shows up as a regression in a few places, because movlhps is one byte smaller, but it enables other optimizations since unpcklpd takes a memory operand plus mask and embedded broadcast.

Vector128.Create(double, 1.0):

- vmovups xmm0, xmmword ptr [reloc @RWD00] - vmovlhps xmm0, xmm1, xmm0 + vunpcklpd xmm0, xmm1, qword ptr [reloc @RWD00] {1to2}

This should probably be peepholed back to vmovlhps if both are from register.

I was thinking the same but would rather save that for a followup. llvm has a replacement list of equivalent instructions that have different sizes, and unpcklpd is on it, as are things like vpermilps, which is replaced by pshufd.

It's worth having a discussion about whether we'd also want to do replacements that switch between float and integer domains. I'll open an issue.

I looked at this again, and there's actually only a size difference for legacy SSE encoding, so it's probably not worth special casing.

saucecontrol · 2025-02-22T00:17:06Z

src/coreclr/jit/decomposelongs.cpp

+    if (varDsc->lvIsParam)
+    {
+        // Promotion blocks combined read optimizations for SIMD loads of long params
+        return;
+    }


In isolation, this change produced a small number of diffs and was mostly an improvement. A few regressions show up in the SPMI reports, but the overall impact is good, especially considering the places we can load a long to vector with movq

saucecontrol · 2025-02-22T18:37:17Z

It occurred to me the optimization to emit pinsrb/w for CreateScalarUnsafe was a bad idea because it creates a false dependency on the upper bits of the target reg. Removed that.

src/coreclr/jit/decomposelongs.cpp

src/coreclr/jit/lowerxarch.cpp

src/coreclr/jit/decomposelongs.cpp

tannergooding

CC. @dotnet/jit-contrib, @EgorBo, @BruceForstall for secondary review

* fix cast asserts * fix containment of CreateScalar * add tests

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 20, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 20, 2025

build-analysis bot mentioned this pull request Feb 20, 2025

LibraryImportGenerator.Unit.Tests crashing on linux-x64 mono interpreter #100800

Open

saucecontrol force-pushed the createscalar64 branch from 197fac5 to 628d4f8 Compare February 20, 2025 18:52

unblock long xplat intrinsics on x86

3a130c8

saucecontrol force-pushed the createscalar64 branch from 628d4f8 to 3a130c8 Compare February 21, 2025 06:31

This was referenced Feb 21, 2025

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked failing with System.OverflowException #112286

Closed

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked test failure #112755

Closed

saucecontrol added 2 commits February 21, 2025 11:51

tidying

7f220c2

tidying2

78dc31d

saucecontrol commented Feb 22, 2025

View reviewed changes

saucecontrol marked this pull request as ready for review February 22, 2025 00:18

saucecontrol added 3 commits February 22, 2025 10:13

Merge remote-tracking branch 'upstream/main' into createscalar64

7330c3e

remove CreateScalarUnsafe opt for small loads

69065ee

skip more redundant casts for CreateScalar of small types

86ebdae

saucecontrol added 5 commits February 22, 2025 11:53

use temp reg for CreateScalar float SSE fallback

cdb0910

formatting patch

5d6fb3f

simplify storeind containment of ToScalar

bb03516

don't use temp reg for CreateScalar float SSE fallback

cba4ab0

Merge remote-tracking branch 'upstream/main' into createscalar64

1f97bd9

build-analysis bot mentioned this pull request Feb 24, 2025

Android emulator not booting completely on Helix queue dotnet/dnceng#1448

Closed

3 tasks

tannergooding reviewed Feb 24, 2025

View reviewed changes

src/coreclr/jit/decomposelongs.cpp Show resolved Hide resolved

tannergooding reviewed Feb 24, 2025

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Feb 24, 2025

View reviewed changes

src/coreclr/jit/decomposelongs.cpp Outdated Show resolved Hide resolved

saucecontrol added 2 commits February 24, 2025 09:17

Merge remote-tracking branch 'upstream/main' into createscalar64

71145ab

skip cast on other memory loads

42a6ab8

use proper containment check

1c98e23

saucecontrol mentioned this pull request Feb 24, 2025

Add peephole optimization to replace x86 SIMD instructions with smaller equivalents #112880

Open

build-analysis bot mentioned this pull request Feb 25, 2025

SslStreamNetworkStreamTest failures "The remote certificate is invalid because of errors in the certificate chain: PartialChain" #112856

Closed

saucecontrol added 4 commits February 25, 2025 12:39

Merge remote-tracking branch 'upstream/main' into createscalar64

c80c566

Merge remote-tracking branch 'upstream/main' into createscalar64

fb2cf30

Merge remote-tracking branch 'upstream/main' into createscalar64

3a76030

add more validation, remove CreateSequence restriction

af01862

tannergooding approved these changes Mar 19, 2025

View reviewed changes

saucecontrol added 2 commits March 19, 2025 13:31

Merge remote-tracking branch 'upstream/main' into createscalar64

811e16e

use appropriate helpers for decomposing ToScalar

aebbf68

BruceForstall approved these changes Mar 20, 2025

View reviewed changes

BruceForstall merged commit 16236fd into dotnet:main Mar 20, 2025
110 checks passed

saucecontrol deleted the createscalar64 branch March 20, 2025 01:00

This was referenced Mar 24, 2025

JIT: Assertion failed genTypeSize(op1) == genTypeSize(simdBaseType) during 'Lowering nodeinfo' #113829

Closed

JIT: Assertion failed (consume == 0) || (ComputeAvailableSrcCount(tree) == consume) during 'Linear scan register alloc' #113832

Closed

jakobbotsch pushed a commit that referenced this pull request Mar 25, 2025

JIT: Fix asserts from #112728 (#113845)

122e701

* fix cast asserts * fix containment of CreateScalar * add tests

saucecontrol mentioned this pull request Mar 27, 2025

JIT: Optimize ConditionalSelect with const zero when condition is not TYP_MASK #113864

Merged

github-actions bot locked and limited conversation to collaborators Apr 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

JIT: Unblock Vector###<long> intrinsics on x86#112728

JIT: Unblock Vector###<long> intrinsics on x86#112728
BruceForstall merged 20 commits intodotnet:mainfrom
saucecontrol:createscalar64

saucecontrol commented Feb 20, 2025 •

edited

Loading

Uh oh!

saucecontrol left a comment •

edited

Loading

Uh oh!

saucecontrol Feb 22, 2025

Uh oh!

saucecontrol Feb 22, 2025

Uh oh!

saucecontrol Feb 22, 2025 •

edited

Loading

Uh oh!

tannergooding Feb 24, 2025

Uh oh!

saucecontrol Feb 24, 2025

Uh oh!

saucecontrol Mar 18, 2025

Uh oh!

saucecontrol Feb 22, 2025

Uh oh!

saucecontrol commented Feb 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		GenTree* op2 = node->Op(2);

		// TODO-XArch-AVX512 : Merge the NI_Vector512_Create and NI_Vector256_Create paths below.

Comments

Conversation

saucecontrol commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saucecontrol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol commented Feb 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saucecontrol commented Feb 20, 2025 •

edited

Loading

saucecontrol left a comment •

edited

Loading

saucecontrol Feb 22, 2025 •

edited

Loading