HOST2GPU.cu is a small test program, which
- initialize a managed memory, named ‘src’, and then prefetch it to dstDev GPU ( GPU0 or GPU1 ).
- copy ‘src’ to another managed memory, named ‘d’, on flag DeviceToDevice.
When I use GPU0, there is a HOST to DEVICE data transfer, but if I use GPU1, except to host-to-device data transfer, there is a transfer from GPU1 to GPU0. Why there is such a transfer from GPU1 to GPU0? and why the size is different (not 4k)?
my environment:
- ubuntu 18.04
- Driver Version: 440.100 CUDA Version: 10.2
- two GeForce RTX 2080
#include "cuda.h"
#include <iostream>
#include <sys/time.h>
using namespace std;
int main(int argv, char* argc[])
{
size_t N = (1l << 10);
cout<<"totalBytes:"<< N <<endl;
char* src, * d;
cudaMallocManaged((void**) &src, N);
cudaMemPrefetchAsync(src, N, cudaCpuDeviceId);
memset( src, 0, N );
cudaDeviceSynchronize();
int dstDev = atoi(argc[1]);
cout << "use gpu "<< dstDev <<endl;
cudaSetDevice(dstDev);
cudaMemPrefetchAsync(src, N, dstDev);
cudaMallocManaged((void**)&d, N);
cudaMemPrefetchAsync(d, N, dstDev);
cudaDeviceSynchronize();
cudaMemcpy(d, src, N, cudaMemcpyDeviceToDevice);
cudaDeviceSynchronize();
}
results of nvprof
$ nvcc HOST2GPU.cu && ./a.out 0
...
==19251== Unified Memory profiling result:
Device "GeForce RTX 2080 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
1 4.0000KB 4.0000KB 4.0000KB 4.000000KB 1.632000us Host To Device
$ nvcc HOST2GPU.cu && ./a.out 1
...
==19021== Unified Memory profiling result:
Device "GeForce RTX 2080 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 - - - - 630.9440us Gpu page fault groups
2 4.0000KB 4.0000KB 4.0000KB 8.000000KB 3.104000us Transfers to Device
Device "GeForce RTX 2080 (1)"
Count Avg Size Min Size Max Size Total Size Total Time Name
1 4.0000KB 4.0000KB 4.0000KB 4.000000KB 1.600000us Host To Device
2 4.0000KB 4.0000KB 4.0000KB 8.000000KB 3.104000us Transfers from Device