I am trying to implement a finite difference method on multiple GPUs, using Cooperative Groups for multi-grid synchronization. I need to transfer halo data between the devices on every iteration of the algorithm. Following CUDA 6’s release, It seems that it should be possible to execute a P2P memory transfer between devices from within the kernel code.
When I use cudaMemcpyAsync with the cudaMemcpyDeviceToDevice parameter within kernel code and set a memory region located at device 0 as the source while having a memory region located at device 1 as the destination, I get an error saying “an illegal memory access was encountered”. It works fine if the source and destination are on the same device, but this is not what I need.
Are my assumptions wrong? Is it not possible to start P2P memory transfers from within kernels?
I am compiling with the options -arch=sm_75 -rdc=true --ptxas-options=-v -lcudadevrt.