Conversation
Implement localloc when the size is known at jit time. The runtime case requires internal registers so can't be implemented yet. Remove the NYT around GS checks, we will rely on the TODO and issue tracking to decide if we want to do something here later on.
There was a problem hiding this comment.
Pull request overview
This PR implements fixed-sized localloc support for WASM RyuJIT by adding the memory.fill instruction and the genLclHeap function to generate code for GT_LCLHEAP nodes when the size is known at JIT time.
Changes:
- Added
memory.fillWASM instruction definition for zeroing allocated memory - Implemented
genLclHeap()to handle compile-time constant localloc operations - Added GT_LCLHEAP case to the main codegen switch statement
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/coreclr/jit/instrswasm.h | Adds memory.fill instruction definition for zeroing memory blocks |
| src/coreclr/jit/codegenwasm.cpp | Implements GT_LCLHEAP codegen with stack adjustment, optional zeroing, and frame pointer saving; variable-sized case marked NYI |
|
This PR could (should?) also set up the For frames where |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara |
|
Per ecma,
So we'll need to do an explicit overflow check too. On native platforms we tend to get this "for free" via stack probing. Seems like we thus need a runtime helper to access to the stack limit? Or else maybe |
Do you mean that we should also insert such checks in the prolog? Otherwise it's not a complete solution. |
I am not sure... how do other languages handle stack overflow on Wasm? |
The C/C++ approach is to "let things be". The default memory layout for debug code (you are presumed to test things with debug code, like we do with checked Jits) is to place the linear memory stack just after In optimized code, the stack is placed after static data (so that relocations to that static data can be made smaller), so an overflow would still eventually reach the bottom of the memory range, but only after potentially overwriting that static data. It is a somewhat frequent point of contention. Finally, there is language in the spec itself that limits the number of WASM frames (not linear memory, proper function call frames) that can exist, effectively meaning that if you can choose a value for the linear memory stack that would be large enough for this engine-determined limit to run out (this is what we've observed to be empirically the case in NAOT-LLVM with 2 stacks totaling ~2 MB between them). |
|
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
PTAL @dotnet/jit-contrib |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| // * whether the size operand is a constant or not | ||
| // * whether the allocated memory needs to be zeroed or not (info.compInitMem) | ||
| // | ||
| // If the sp is changed, the value at sp[0] must be the frame pointer |
There was a problem hiding this comment.
I think we should add the exact unwinding scheme for variable-sized frames we settled on here into the ABI document.
| bool const needsZeroing = m_compiler->info.compInitMem; | ||
| GenTree* const size = tree->AsOp()->gtOp1; | ||
|
|
||
| assert(genActualType(size->gtType) == TYP_I_IMPL); |
There was a problem hiding this comment.
| assert(genActualType(size->gtType) == TYP_I_IMPL); | |
| assert(genActualType(size) == TYP_I_IMPL); |
|
|
||
| case GT_LCLHEAP: | ||
| LowerLclHeap(node); | ||
| AfterLowerLclHeap(node); |
There was a problem hiding this comment.
Call this in LowerLclHeap instead?
|
|
||
| public: | ||
| void emitIns(instruction ins); | ||
| void emitIns_B(instruction ins, WasmValueType valType = WasmValueType::Invalid); |
There was a problem hiding this comment.
Nit: emitIns_BlockTy would be more immediately obvious (we already use Ty to mean 'value type').
|
|
||
| if (!size->isContainedIntOrIImmed()) | ||
| { | ||
| size = m_compiler->gtNewCastNode(TYP_I_IMPL, size, false, TYP_I_IMPL); |
There was a problem hiding this comment.
For 32 bit WASM TYP_I_IMPL == TYP_INT. So a no-op.
Also, should this be an unsigned cast?
Also, this would deserve a comment as to why we're deciding to add a cast here (| why is it better than the alternatives).
| instrDesc* id = emitNewInstrSC(EA_4BYTE, (cnsval_ssize_t)valType); | ||
| insFormat fmt = emitInsFormat(ins); | ||
|
|
||
| id->idIns(ins); | ||
| id->idInsFmt(fmt); | ||
|
|
||
| dispIns(id); | ||
| appendToCurIG(id); |
There was a problem hiding this comment.
Just:
| instrDesc* id = emitNewInstrSC(EA_4BYTE, (cnsval_ssize_t)valType); | |
| insFormat fmt = emitInsFormat(ins); | |
| id->idIns(ins); | |
| id->idInsFmt(fmt); | |
| dispIns(id); | |
| appendToCurIG(id); | |
| emitIns_I(ins, EA_4BYTE, (cnsval_ssize_t)valType); |
?
| } | ||
|
|
||
| if (anyFrameLocals) | ||
| if (anyFrameLocals || m_codeGen->isFramePointerRequired()) |
There was a problem hiding this comment.
I think it would be better to use the narrower compLocallocUsed here instead of isFramePointerRequired, with a comment to the effect of "LCLHEAP codegen uses FP for unwind info establishment..." (we wouldn't need the FP with a different scheme).
isFramePointerRequired is set with lots of conditions that are deeply irrelevant for WASM and probably should be ifdef-ed out. There are no FP chains on WASM.
| // | ||
| void WasmRegAlloc::CollectReferencesForLclHeap(GenTreeOp* lclHeapNode) | ||
| { | ||
| ConsumeTemporaryRegForOperand(lclHeapNode->gtGetOp1() DEBUGARG("lcl heap")); |
There was a problem hiding this comment.
| ConsumeTemporaryRegForOperand(lclHeapNode->gtGetOp1() DEBUGARG("lcl heap")); | |
| ConsumeTemporaryRegForOperand(lclHeapNode->gtGetOp1() DEBUGARG("lcl heap size")); |
|
|
||
| // TODO-WASM-CQ: possibly do small fills directly | ||
| // | ||
| GetEmitter()->emitIns_I(INS_memory_fill, EA_4BYTE, 0); |
There was a problem hiding this comment.
(This should rendezvous with the block store PR on using a named constant for the memory index)
| // Save the aligned size for the zeroing call below if needed. | ||
| if (needsZeroing) | ||
| { | ||
| GetEmitter()->emitIns_I(INS_local_tee, EA_PTRSIZE, WasmRegToIndex(sizeReg)); |
There was a problem hiding this comment.
GetMultiUseOperandReg regs cannot be modified like this in the general case (they may still be live if they're candidates).
There was a problem hiding this comment.
Ah, good point.
I will revise this to use an internal register. Doing that I can also avoid the need for the extra casting logic in lower.
Implement localloc.