Unnecessary traffic in Load-Reduce and Store operations on a Multicast object

jiachen17713 · January 7, 2026, 2:42am

After using PTX instructions “multimem.ld_reduce.add.v4.f32” on device 0 to load-reduce float arrays on H20 devices 0-3 (each size 16MB), I saw the report from “nvidia-smi nvlink -gt d” that through NVLink each of the 4 devices sends 16MB data while only device 0 receives 16MB data. My question is, for minimizing data exchange, device 0 does not need to send its array values (to other devices or nvswitch) because it can see its array locally, so only 3 devices need to send data.

The same problem exists for multimem storage operation. After executing PTX instructions “multimem.st.v4.f32” on device 0 to broadcast 16MB data to device 0-3, the result from “nvidia-smi nvlink -gt d” shows that device 0 also receives 16MB data via NVLink, which is absolutely unnecessary since buffer on the memory of device 0 can receive data locally.

I tried to solve the problems above by not letting the executor device to join the multicast team. I found the following limitations during the process:

The CUDA driver denies to enable RW access to a multicast memory from a device which is not included in the multicast object(CUDA_ERROR_ILLEGAL_STATE when calling cuMemSetAccess).
When some of the participating devices have not bind its memory to a multicast object, performing ld_reduce on this object will result in memory errors.

Topic		Replies	Views
How do the multimem NVLinkSharp (NVLS) instructions work under the hood? CUDA Programming and Performance	0	508	June 24, 2025
General questions about multicast objects CUDA Programming and Performance cuda	0	603	August 22, 2023
Host to multiple device transfers CUDA Programming and Performance	0	2325	January 20, 2012
Question about PTX instruction multimem.ld_reduce precision CUDA Programming and Performance cuda	1	107	July 23, 2025
Reduction Problem CUDA Programming and Performance	5	4828	October 13, 2010
CUDA SDK Reduction correctness question CUDA Programming and Performance	0	1064	August 14, 2008
Multi-GPU, simple kernel CUDA Programming and Performance	8	1709	July 15, 2013
Reduction loop generates CUDA memcpy Legacy PGI Compilers	6	4670	January 26, 2016
Inconsistent results for reduction, except while printf or cudamemcheck CUDA Programming and Performance	29	2642	September 13, 2016
Multi devices in single machine. Communication CUDA Programming and Performance	1	2238	June 10, 2008

Unnecessary traffic in Load-Reduce and Store operations on a Multicast object

Related topics