Question about PTX instruction multimem.ld_reduce precision

from here I see that multimem.ld_reduce instructions (1. Introduction — PTX ISA 8.8 documentation) has an argument called acc_prec. and for some precisions

here i have some questions?

  • if not set this precision, then what is the default acc_reduce precision?
  • float point add result is influenced by the accumulate order, in which order is multimem.ld_reduce performing the accumulation? Is there any non-deterministic precision, or it always follow some pattern that values from GPU 0 + 1 + 2 + …, or some other order?

and another question: since the accumulate is done by the nvswitch, suppose we have some bf16 type data, then with multimem.ld_reduce the procedure is like:

x_bf16 = v(0)_bf16 + v(1)_bf16 + v(2)_bf16 + ... + v(8)_bf16

the acc_prec is in this way:

x_f32 = v(0)_bf16.to(f32) + v(1)_bf16.to(f32) + v(2)_bf16.to(f32) + ... + v(8)_bf16.to(f32)
x_bf16 = x_f32.to(bf16)

nvswitch just send/receives the same amount of data, but with higher precision.

is this what multimem.ld_reduce acc_prec argument does? why is this not the default way?