You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.
Here's what happens after calling into __morestack, on the fast path
Set up the frame pointer
Push all possible argument registers of the calling function in case the call to upcall_new_stack clobbers them
Shuffle the argument registers from the __morestack custom calling convention registers to the C calling convention registers used by upcall_new_stack
Call upcall_new_stack, through the indirection of the dynamic linker
Call get_sp_limit, an entire assembly function consisting of movq %fs:112, %rax
Compare the sp_limit to 0 and don't branch to the rust_get_current_task slow path. This branch always makes the same decision during a __morestack call.
Do some math to find the task pointer from the stack limit
Check the stack canary to make sure we haven't run off the end of the stack
Assert that the task pointer is not null
Get the minimum stack size
Do some simple math and pointer indirections to determine if task->stk->next is a big enough stack segment to use
Assert some invariants
memcpy the arguments from the old stack to the new stack
Align the new stack frame
Call reuse_valgrind_stack to give valgrind hints
Call record_stack_limit to execute another single instruction
Return the stack pointer to __morestack
Pop all the saved argument registers
Finally, call the original function
And returning from the segment:
Call upcall_del_stack through the dynamic linker
Call get_sp_limit, an entire function consisting of movq %fs:112, %rax
Compare the sp_limit to 0, etc.
Check the stack canary to make sure we haven't run off the end of the stack
Assert that the task pointer is not null
Update the current stack pointer in the task
Call record_stack_limit
Potential optimizations:
Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder Invoke instructions kick us off the FastISel path #3551.
Statically link upcall_new_stack and upcall_del_stack, hitting new dynamically linked upcalls for the slow path
Create a new version of rust_get_current_task that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.
Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also Remove unnecessary logic in new_stack_fast #3566).
Put all asserts under the compile-time debug flag, including the canary check
Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.
Here's what happens after calling into
__morestack, on the fast pathupcall_new_stackclobbers them__morestackcustom calling convention registers to the C calling convention registers used byupcall_new_stackupcall_new_stack, through the indirection of the dynamic linkerget_sp_limit, an entire assembly function consisting ofmovq %fs:112, %raxsp_limitto 0 and don't branch to therust_get_current_taskslow path. This branch always makes the same decision during a__morestackcall.taskpointer from the stack limittask->stk->nextis a big enough stack segment to usereuse_valgrind_stackto give valgrind hintsrecord_stack_limitto execute another single instruction__morestackAnd returning from the segment:
upcall_del_stackthrough the dynamic linkerget_sp_limit, an entire function consisting ofmovq %fs:112, %raxsp_limitto 0, etc.record_stack_limitPotential optimizations:
get_sp_limit,record_stack_limit(Inline get_sp_limit, set_sp_limit, get_sp runtime functions #2521)upcall_new_stackandupcall_del_stack, hitting new dynamically linked upcalls for the slow pathrust_get_current_taskthat doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.upcall_new_stackdoesn't use xmm registers and remove the xmm saves and restores in__morestackStop saving floating point registers in __morestack #2043upcall_del_stackinto__morestack