After using PTX instructions “multimem.ld_reduce.add.v4.f32” on device 0 to load-reduce float arrays on H20 devices 0-3 (each size 16MB), I saw the report from “nvidia-smi nvlink -gt d” that through NVLink each of the 4 devices sends 16MB data while only device 0 receives 16MB data. My question is, for minimizing data exchange, device 0 does not need to send its array values (to other devices or nvswitch) because it can see its array locally, so only 3 devices need to send data.
The same problem exists for multimem storage operation. After executing PTX instructions “multimem.st.v4.f32” on device 0 to broadcast 16MB data to device 0-3, the result from “nvidia-smi nvlink -gt d” shows that device 0 also receives 16MB data via NVLink, which is absolutely unnecessary since buffer on the memory of device 0 can receive data locally.
I tried to solve the problems above by not letting the executor device to join the multicast team. I found the following limitations during the process:
- The CUDA driver denies to enable RW access to a multicast memory from a device which is not included in the multicast object(CUDA_ERROR_ILLEGAL_STATE when calling cuMemSetAccess).
- When some of the participating devices have not bind its memory to a multicast object, performing ld_reduce on this object will result in memory errors.

