I am looking to write a Linux application that needs to do a bunch of CUDA calculations on a GPU device, and transfer it to memory on another device.
I want to avoid the copy to the host memory, and transfer it from GPU memory, directly to the device memory.
If I used mmap to put the device in the host memory address space, could I then use cudaMemcpyHostToDevice() or something to write it directly to the device?