@@ -180,7 +180,7 @@ All the instructions take 3 operands, `a`, `b`, `c`, perform `a * b + c` or `-(a
180180where:
181181
182182- the intermediate ` b * c ` is be rounded first, and the final result rounded again (for a total of 2 roundings), or
183- - the the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).
183+ - the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).
184184
185185### Relaxed laneselect
186186
@@ -279,6 +279,32 @@ i16x8_dot_i8x16_i7x16_s(a, b) = dot_product(signed=True, elements=2, a, b
279279i32x4.dot_i8x16_i7x16_add_s(a, b, c) = dot_product(signed = False , elements = 2 , a, b, c)
280280```
281281
282+ ### Relaxed BFloat16 dot product
283+
284+ - ` f32x4.relaxed_dot_bf16x8_add_f32x4(a: v128, b: v128, c: v128) -> v128 `
285+
286+ BFloat16 is a 16-bit floating-point format that represents the IEEE FP32 numbers
287+ truncated to the high 16 bits. This instruction computes a FP32 dot product of 2
288+ BFloat16 with accumulation into another FP32.
289+
290+ ``` python
291+ def bfloat16_dot_product (a , b , c ):
292+ for i in range (8 ):
293+ y.fp32[i] =
294+ y.fp32[i] +
295+ cast< fp32> (a.bf16[2 * i]) * cast< fp32> (b.bf16[2 * i]) +
296+ cast< fp32> (a.bf16[2 * i+ 1 ]) * cast< fp32> (b.bf16[2 * i+ 1 ])
297+ ```
298+
299+ This instruction is implementation defined in the following ways:
300+
301+ - evaluation order
302+ - can compute dot product in one step, then accumulation in another, or
303+ - accumulate first product in one step, then accumulate second product in
304+ another step
305+ - fusion, the steps described above can be both fused or both unfused
306+ - the intermediate results can be Round-to-Nearest-Even or Round-to-Odd.
307+
282308
283309## Binary format
284310
@@ -290,30 +316,30 @@ where chosen to fit into the holes in the opcode space of SIMD proposal. Going
290316forward, the opcodes for relaxed-simd specification will be the ones in the
291317"opcode" column, and it will take some time for tools and engines to update.
292318
293- | instruction | opcode | prototype opcode |
294- | ---------------------------------- | -------------- | ---------------- |
295- | ` i8x16.relaxed_swizzle ` | 0x100 | 0xa2 |
296- | ` i32x4.relaxed_trunc_f32x4_s ` | 0x101 | 0xa5 |
297- | ` i32x4.relaxed_trunc_f32x4_u ` | 0x102 | 0xa6 |
298- | ` i32x4.relaxed_trunc_f64x2_s_zero ` | 0x103 | 0xc5 |
299- | ` i32x4.relaxed_trunc_f64x2_u_zero ` | 0x104 | 0xc6 |
300- | ` f32x4.relaxed_fma ` | 0x105 | 0xaf |
301- | ` f32x4.relaxed_fms ` | 0x106 | 0xb0 |
302- | ` f64x2.relaxed_fma ` | 0x107 | 0xcf |
303- | ` f64x2.relaxed_fms ` | 0x108 | 0xd0 |
304- | ` i8x16.relaxed_laneselect ` | 0x109 | 0xb2 |
305- | ` i16x8.relaxed_laneselect ` | 0x10a | 0xb3 |
306- | ` i32x4.relaxed_laneselect ` | 0x10b | 0xd2 |
307- | ` i64x2.relaxed_laneselect ` | 0x10c | 0xd3 |
308- | ` f32x4.relaxed_min ` | 0x10d | 0xb4 |
309- | ` f32x4.relaxed_max ` | 0x10e | 0xe2 |
310- | ` f64x2.relaxed_min ` | 0x10f | 0xd4 |
311- | ` f64x2.relaxed_max ` | 0x110 | 0xee |
312- | ` i16x8.relaxed_q15mulr_s ` | 0x111 | unimplemented |
313- | ` i16x8.dot_i8x16_i7x16_s ` | 0x112 | unimplemented |
314- | ` i32x4.dot_i8x16_i7x16_add_s ` | 0x113 | unimplemented |
315- | Reserved for bfloat16 | 0x114 | unimplemented |
316- | Reserved | 0x115 - 0x12F | |
319+ | instruction | opcode | prototype opcode |
320+ | ------------------------------------ | -------------- | ---------------- |
321+ | ` i8x16.relaxed_swizzle ` | 0x100 | 0xa2 |
322+ | ` i32x4.relaxed_trunc_f32x4_s ` | 0x101 | 0xa5 |
323+ | ` i32x4.relaxed_trunc_f32x4_u ` | 0x102 | 0xa6 |
324+ | ` i32x4.relaxed_trunc_f64x2_s_zero ` | 0x103 | 0xc5 |
325+ | ` i32x4.relaxed_trunc_f64x2_u_zero ` | 0x104 | 0xc6 |
326+ | ` f32x4.relaxed_fma ` | 0x105 | 0xaf |
327+ | ` f32x4.relaxed_fms ` | 0x106 | 0xb0 |
328+ | ` f64x2.relaxed_fma ` | 0x107 | 0xcf |
329+ | ` f64x2.relaxed_fms ` | 0x108 | 0xd0 |
330+ | ` i8x16.relaxed_laneselect ` | 0x109 | 0xb2 |
331+ | ` i16x8.relaxed_laneselect ` | 0x10a | 0xb3 |
332+ | ` i32x4.relaxed_laneselect ` | 0x10b | 0xd2 |
333+ | ` i64x2.relaxed_laneselect ` | 0x10c | 0xd3 |
334+ | ` f32x4.relaxed_min ` | 0x10d | 0xb4 |
335+ | ` f32x4.relaxed_max ` | 0x10e | 0xe2 |
336+ | ` f64x2.relaxed_min ` | 0x10f | 0xd4 |
337+ | ` f64x2.relaxed_max ` | 0x110 | 0xee |
338+ | ` i16x8.relaxed_q15mulr_s ` | 0x111 | unimplemented |
339+ | ` i16x8.dot_i8x16_i7x16_s ` | 0x112 | unimplemented |
340+ | ` i32x4.dot_i8x16_i7x16_add_s ` | 0x113 | unimplemented |
341+ | ` f32x4.relaxed_dot_bf16x8_add_f32x4 ` | 0x114 | unimplemented |
342+ | Reserved | 0x115 - 0x12F | |
317343
318344## References
319345
0 commit comments