I am using UM in 2 Volta devices, but the behaviour is not what I expect. I am using cudaLaunchCooperativeKernelMultiDevice, with a kernel reading and another writing to the same address, obtained with cudaMallocManaged. All kernels call this_multi_grid().sync() after the iteration. However, the writes done by the writing kernel are not being seen by the reading kernel. Is there any specific configuration that needs to be done to correct this?
As additional information, I have used https://github.com/NVIDIA/cuda-samples/blob/master/Samples/conjugateGradientMultiDeviceCG as reference. I am also calling this_multi_grid().sync() endlessly in a loop (in both kernels) until the reading kernel sees the right value (which never happens). I also tried to write the values in the same loop, to no avail. Am I wrong in expecting the values to be exposed in all devices?
I don’t think you’re wrong to expect a managed allocation to be visible to multiple devices. However I can’t provide any assistance based on generalities in this case.
Is there anything I can do to debug this on the GPUs platform? Or some particular configuration I should look for?