P2P memory transfer within kernel code


I am trying to implement a finite difference method on multiple GPUs, using Cooperative Groups for multi-grid synchronization. I need to transfer halo data between the devices on every iteration of the algorithm. Following CUDA 6’s release, It seems that it should be possible to execute a P2P memory transfer between devices from within the kernel code.

When I use cudaMemcpyAsync with the cudaMemcpyDeviceToDevice parameter within kernel code and set a memory region located at device 0 as the source while having a memory region located at device 1 as the destination, I get an error saying “an illegal memory access was encountered”. It works fine if the source and destination are on the same device, but this is not what I need.

Are my assumptions wrong? Is it not possible to start P2P memory transfers from within kernels?

I am compiling with the options -arch=sm_75 -rdc=true --ptxas-options=-v -lcudadevrt.

To transfer data P2P from device code, one possible approach:

  1. Place the devices in question into a P2P relationship (in host code). Refer to any of the p2p sample codes for examples.

  2. Pass pointers and copy data using ordinary load and store instructions in device code:

    for (int i = threadIdx.x+blockDim.x*blockIdx.x; i < data_size_to_transfer; i += gridDim.x*blockDim.x) device1data[i] = device0data[i];
1 Like

It seems that the problem was that P2P was not enabled before launching the kernel. I became aware that my GPUs (2x 1660Ti) did not support P2P, and therefore it worked when running on a different setup (2x 980).