Why there is a memory transfer between two GPUs

HOST2GPU.cu is a small test program, which

  1. initialize a managed memory, named ‘src’, and then prefetch it to dstDev GPU ( GPU0 or GPU1 ).
  2. copy ‘src’ to another managed memory, named ‘d’, on flag DeviceToDevice.

When I use GPU0, there is a HOST to DEVICE data transfer, but if I use GPU1, except to host-to-device data transfer, there is a transfer from GPU1 to GPU0. Why there is such a transfer from GPU1 to GPU0? and why the size is different (not 4k)?

my environment:

  • ubuntu 18.04
  • Driver Version: 440.100 CUDA Version: 10.2
  • two GeForce RTX 2080
#include "cuda.h"
#include <iostream>
#include <sys/time.h>

using namespace std;

int main(int argv, char* argc[])
{
    size_t N = (1l << 10);
    cout<<"totalBytes:"<< N <<endl;
    char* src, * d;

    cudaMallocManaged((void**) &src, N); 
    cudaMemPrefetchAsync(src, N, cudaCpuDeviceId);
    memset( src, 0, N );
    cudaDeviceSynchronize();                                                                                                              

    int dstDev = atoi(argc[1]);
    cout << "use gpu "<< dstDev <<endl;
        
    cudaSetDevice(dstDev);
    cudaMemPrefetchAsync(src, N, dstDev);
          
    cudaMallocManaged((void**)&d, N); 
    cudaMemPrefetchAsync(d, N, dstDev);
    cudaDeviceSynchronize();
          
    cudaMemcpy(d, src, N, cudaMemcpyDeviceToDevice);
    cudaDeviceSynchronize();
         
}

results of nvprof

$ nvcc HOST2GPU.cu && ./a.out 0
...
==19251== Unified Memory profiling result:
Device "GeForce RTX 2080 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       1  4.0000KB  4.0000KB  4.0000KB  4.000000KB  1.632000us  Host To Device
$ nvcc HOST2GPU.cu && ./a.out 1
...
==19021== Unified Memory profiling result:
Device "GeForce RTX 2080 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       2         -         -         -           -  630.9440us  Gpu page fault groups
       2  4.0000KB  4.0000KB  4.0000KB  8.000000KB  3.104000us  Transfers to Device
Device "GeForce RTX 2080 (1)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       1  4.0000KB  4.0000KB  4.0000KB  4.000000KB  1.600000us  Host To Device
       2  4.0000KB  4.0000KB  4.0000KB  8.000000KB  3.104000us  Transfers from Device

Unless connected by NVLink, the 2080 are not p2p compatible. Device 0 cannot read Device 1 memory.

Nvidia gimped the 2000 series to disallow p2p over pcie. I think this may be the first time I’ve seen a low-level cuda code actually demonstrate this shortcoming

See:

I’m not sure that has any bearing on the issue. In my opinion, the code is demonstrating “unexpected” device-to-device activity, when the code design does not suggest there should be any (and, indeed, when device 0 is specified, there isn’t). I can reproduce the observation of unexpected device to device activity (when device 1 is specified) on a system with two K20X GPUs, whether those GPUs are “peerable” or not. So the observation of unexpected device to device activity is not dependent, from what I can tell, on whether or not the GPUs in question are 2080’s, or whether the GPUs in question are peerable. I also observe that if I move the cudaSetDevice call to the beginning of the code, before the first call to cudaMallocManaged, then the curious behavior goes away. It appears that the first cudaMallocManaged call is establishing a default association with the device that is selected at that time (which would be GPU 0), even though we subsequently establish residency for that allocation on the CPU (via the prefetch call). This may be by design of the managed memory subsystem, I cannot say.

In any event, I see no data to associate this with the 2080 peer behavior referenced. It seems that OP is also looking for an explanation of device to device activity, based on the comment on the SO cross posting:

peerability, or lack of it, could not explain the existence of device-to-device activity, and indeed it does not explain it according to my testing.

1 Like

@Robert_Crovella

Ahhh… correct. Skimmed through and did not notice the “unexpected” part or the output given trivial input. Thanks for clarification.