Skip to content

Commit b58bb2e

Browse files
committed
Add BFloat16 dot product
1 parent 26f1b78 commit b58bb2e

File tree

1 file changed

+51
-25
lines changed

1 file changed

+51
-25
lines changed

‎proposals/relaxed-simd/Overview.md‎

Lines changed: 51 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ All the instructions take 3 operands, `a`, `b`, `c`, perform `a * b + c` or `-(a
180180
where:
181181

182182
- the intermediate `b * c` is be rounded first, and the final result rounded again (for a total of 2 roundings), or
183-
- the the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).
183+
- the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).
184184

185185
### Relaxed laneselect
186186

@@ -279,6 +279,32 @@ i16x8_dot_i8x16_i7x16_s(a, b) = dot_product(signed=True, elements=2, a, b
279279
i32x4.dot_i8x16_i7x16_add_s(a, b, c) = dot_product(signed=False, elements=2, a, b, c)
280280
```
281281

282+
### Relaxed BFloat16 dot product
283+
284+
- `f32x4.relaxed_dot_bf16x8_add_f32x4(a: v128, b: v128, c: v128) -> v128`
285+
286+
BFloat16 is a 16-bit floating-point format that represents the IEEE FP32 numbers
287+
truncated to the high 16 bits. This instruction computes a FP32 dot product of 2
288+
BFloat16 with accumulation into another FP32.
289+
290+
```python
291+
def bfloat16_dot_product(a, b, c):
292+
for i in range(8):
293+
y.fp32[i] =
294+
y.fp32[i] +
295+
cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) +
296+
cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
297+
```
298+
299+
This instruction is implementation defined in the following ways:
300+
301+
- evaluation order
302+
- can compute dot product in one step, then accumulation in another, or
303+
- accumulate first product in one step, then accumulate second product in
304+
another step
305+
- fusion, the steps described above can be both fused or both unfused
306+
- the intermediate results can be Round-to-Nearest-Even or Round-to-Odd.
307+
282308

283309
## Binary format
284310

@@ -290,30 +316,30 @@ where chosen to fit into the holes in the opcode space of SIMD proposal. Going
290316
forward, the opcodes for relaxed-simd specification will be the ones in the
291317
"opcode" column, and it will take some time for tools and engines to update.
292318

293-
| instruction | opcode | prototype opcode |
294-
| ---------------------------------- | -------------- | ---------------- |
295-
| `i8x16.relaxed_swizzle` | 0x100 | 0xa2 |
296-
| `i32x4.relaxed_trunc_f32x4_s` | 0x101 | 0xa5 |
297-
| `i32x4.relaxed_trunc_f32x4_u` | 0x102 | 0xa6 |
298-
| `i32x4.relaxed_trunc_f64x2_s_zero` | 0x103 | 0xc5 |
299-
| `i32x4.relaxed_trunc_f64x2_u_zero` | 0x104 | 0xc6 |
300-
| `f32x4.relaxed_fma` | 0x105 | 0xaf |
301-
| `f32x4.relaxed_fms` | 0x106 | 0xb0 |
302-
| `f64x2.relaxed_fma` | 0x107 | 0xcf |
303-
| `f64x2.relaxed_fms` | 0x108 | 0xd0 |
304-
| `i8x16.relaxed_laneselect` | 0x109 | 0xb2 |
305-
| `i16x8.relaxed_laneselect` | 0x10a | 0xb3 |
306-
| `i32x4.relaxed_laneselect` | 0x10b | 0xd2 |
307-
| `i64x2.relaxed_laneselect` | 0x10c | 0xd3 |
308-
| `f32x4.relaxed_min` | 0x10d | 0xb4 |
309-
| `f32x4.relaxed_max` | 0x10e | 0xe2 |
310-
| `f64x2.relaxed_min` | 0x10f | 0xd4 |
311-
| `f64x2.relaxed_max` | 0x110 | 0xee |
312-
| `i16x8.relaxed_q15mulr_s` | 0x111 | unimplemented |
313-
| `i16x8.dot_i8x16_i7x16_s` | 0x112 | unimplemented |
314-
| `i32x4.dot_i8x16_i7x16_add_s` | 0x113 | unimplemented |
315-
| Reserved for bfloat16 | 0x114 | unimplemented |
316-
| Reserved | 0x115 - 0x12F | |
319+
| instruction | opcode | prototype opcode |
320+
| ------------------------------------ | -------------- | ---------------- |
321+
| `i8x16.relaxed_swizzle` | 0x100 | 0xa2 |
322+
| `i32x4.relaxed_trunc_f32x4_s` | 0x101 | 0xa5 |
323+
| `i32x4.relaxed_trunc_f32x4_u` | 0x102 | 0xa6 |
324+
| `i32x4.relaxed_trunc_f64x2_s_zero` | 0x103 | 0xc5 |
325+
| `i32x4.relaxed_trunc_f64x2_u_zero` | 0x104 | 0xc6 |
326+
| `f32x4.relaxed_fma` | 0x105 | 0xaf |
327+
| `f32x4.relaxed_fms` | 0x106 | 0xb0 |
328+
| `f64x2.relaxed_fma` | 0x107 | 0xcf |
329+
| `f64x2.relaxed_fms` | 0x108 | 0xd0 |
330+
| `i8x16.relaxed_laneselect` | 0x109 | 0xb2 |
331+
| `i16x8.relaxed_laneselect` | 0x10a | 0xb3 |
332+
| `i32x4.relaxed_laneselect` | 0x10b | 0xd2 |
333+
| `i64x2.relaxed_laneselect` | 0x10c | 0xd3 |
334+
| `f32x4.relaxed_min` | 0x10d | 0xb4 |
335+
| `f32x4.relaxed_max` | 0x10e | 0xe2 |
336+
| `f64x2.relaxed_min` | 0x10f | 0xd4 |
337+
| `f64x2.relaxed_max` | 0x110 | 0xee |
338+
| `i16x8.relaxed_q15mulr_s` | 0x111 | unimplemented |
339+
| `i16x8.dot_i8x16_i7x16_s` | 0x112 | unimplemented |
340+
| `i32x4.dot_i8x16_i7x16_add_s` | 0x113 | unimplemented |
341+
| `f32x4.relaxed_dot_bf16x8_add_f32x4` | 0x114 | unimplemented |
342+
| Reserved | 0x115 - 0x12F | |
317343

318344
## References
319345

0 commit comments

Comments
 (0)