You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
Lanes with an out-of-range selector become 0 in the output vector.
According to the Intel manual (and some experiments I ran), PSHUFB uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g. 0b10000000), then the result is zeroed. But index values in between 0x0f and 0x80 will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, and OR this with the original swizzle mask before using the PSHUFB instruction--four instructions instead of one.
Should v128.swizzle change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?
Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement
v128.swizzleon x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:According to the Intel manual (and some experiments I ran),
PSHUFBuses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g.0b10000000), then the result is zeroed. But index values in between0x0fand0x80will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, andORthis with the original swizzle mask before using thePSHUFBinstruction--four instructions instead of one.Should
v128.swizzlechange to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?