In the multi-processor and multi-GPU scenario, how to exchange data between GPUs?

There are 2 CPUs and 8 GPUs, among which CPU1 is bound to gpu1-4 and CPU2 is bound to gpu5-8. How to copy data from gpu8 to gpu1?

One possible approach:

$ cat t43.cu

int main(){

  float *d1, *d2;
  cudaSetDevice(0);  // or another device
  cudaMalloc(&d1, 128);
  cudaSetDevice(2); // or another device
  cudaMalloc(&d2, 128);
  cudaMemcpy(d2, d1, 128, cudaMemcpyDeviceToDevice);
  cudaDeviceSynchronize();
}
$ nvcc -o t43 t43.cu
$ cuda-memcheck ./t43
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t43
==14495== NVPROF is profiling process 14495, command: ./t43
==14495== Profiling application: ./t43
==14495== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   53.96%  3.2640us         1  3.2640us  3.2640us  3.2640us  [CUDA memcpy DtoH]
                   46.04%  2.7850us         1  2.7850us  2.7850us  2.7850us  [CUDA memcpy HtoD]
      API calls:   98.48%  542.04ms         2  271.02ms  179.75ms  362.29ms  cudaMalloc
                    0.91%  5.0097ms         4  1.2524ms  590.58us  3.2132ms  cuDeviceTotalMem
                    0.45%  2.5001ms       404  6.1880us     355ns  280.91us  cuDeviceGetAttribute
                    0.07%  391.83us         1  391.83us  391.83us  391.83us  cudaDeviceSynchronize
                    0.06%  332.39us         4  83.098us  59.127us  150.61us  cuDeviceGetName
                    0.01%  68.397us         1  68.397us  68.397us  68.397us  cudaMemcpy
                    0.00%  22.333us         2  11.166us  2.7670us  19.566us  cudaSetDevice
                    0.00%  18.089us         4  4.5220us  3.0330us  7.4510us  cuDeviceGetPCIBusId
                    0.00%  6.5020us         8     812ns     460ns  1.5000us  cuDeviceGet
                    0.00%  5.6310us         3  1.8770us     565ns  3.7430us  cuDeviceGetCount
                    0.00%  3.2840us         4     821ns     597ns  1.1370us  cuDeviceGetUuid
$

If the GPUs have a direct NVLink connection between them, you could improve on this with cudaMemcpyPeerAsync (refer to cuda sample codes such as simpleP2P). If the GPUs have an NVLink fabric connecting them but no direct connection, another option would be to use NCCL point-to-point communication.

Thank you for your reply, but in my use case, gpu1 and gpu8 are bound with different CPUs. Can I copy them like this? GPU are 3090s, no NVLinks.

Why not give it a quick try? Should be doable in under a minute …

I haven’t bought my new machine yet. Now it’s single CPU and multi GPU