Inefficient x64 codegen for swizzle

Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement `v128.swizzle` on x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:

> Lanes with an out-of-range selector become 0 in the output vector.

According to the Intel manual (and some experiments I ran), `PSHUFB` uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g. `0b10000000`), then the result is zeroed. But index values in between `0x0f` and `0x80` will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, and `OR` this with the original swizzle mask before using the `PSHUFB` instruction--four instructions instead of one.

Should `v128.swizzle` change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient x64 codegen for swizzle #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inefficient x64 codegen for swizzle #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions