Bad performance when using unified memory

I’m doing experiments using unified memory. The code just performs simple matrix addition.

I record the exetution time for:

  1. CPU execution

  2. GPU execution using manual memory copy (1) data copy time (2) kernel execution time
    full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch2/sumMatricesOnDevice.cu

start = clock();
cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);
end = clock();
double copyTime = ((double) (end - start)) / CLOCKS_PER_SEC;
...
start = clock();
sumArraysOnDevice<<<grid, block>>>(d_A, d_B, d_C, nx, ny);
cudaDeviceSynchronize();
end = clock();
double gpuTime = ((double) (end - start)) / CLOCKS_PER_SEC;
  1. GPU execution using unified memory
    full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch4/sumMatricesUnify.cu
start = clock();
sumArraysOnDevice<<<grid, block>>>(A, B, C_gpu, nx, ny);
cudaDeviceSynchronize();
end = clock();
double gpuTime = ((double) (end - start)) / CLOCKS_PER_SEC;

The results show that using unified memory leads to worse performance than even CPU execution:
CPU: 0.6054s
GPU: data copy 0.3437 s + execution 0.0094 s
GPU unify: 0.6901

I also profiled GPU-unified-memory kernel:

==18742== Profiling application: ./sumMatricesUnify
==18742== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  733.00ms         1  733.00ms  733.00ms  733.00ms  sumArraysOnDevice(float*, float*, float*, int, int)
      API calls:   67.73%  733.06ms         1  733.06ms  733.06ms  733.06ms  cudaDeviceSynchronize
                   15.11%  163.52ms         4  40.880ms  20.078us  163.42ms  cudaMallocManaged
                   12.52%  135.54ms         4  33.884ms  30.281ms  39.130ms  cudaFree
                    4.54%  49.172ms         1  49.172ms  49.172ms  49.172ms  cudaDeviceReset
                    0.04%  439.77us        94  4.6780us      97ns  216.11us  cuDeviceGetAttribute
                    0.04%  393.58us         1  393.58us  393.58us  393.58us  cudaGetDeviceProperties
                    0.01%  90.503us         1  90.503us  90.503us  90.503us  cuDeviceTotalMem
                    0.01%  55.776us         1  55.776us  55.776us  55.776us  cudaLaunch
                    0.00%  40.911us         1  40.911us  40.911us  40.911us  cuDeviceGetName
                    0.00%  10.125us         1  10.125us  10.125us  10.125us  cudaConfigureCall
                    0.00%  3.2900us         1  3.2900us  3.2900us  3.2900us  cudaSetDevice
                    0.00%  1.8260us         5     365ns      75ns     967ns  cudaSetupArgument
                    0.00%  1.1570us         3     385ns     108ns     892ns  cuDeviceGetCount
                    0.00%  1.1310us         1  1.1310us  1.1310us  1.1310us  cudaGetLastError
                    0.00%     532ns         2     266ns     111ns     421ns  cuDeviceGet

==18742== Unified Memory profiling result:
Device "GeForce GTX 1080 Ti (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   49149  60.497KB  4.0000KB  0.9961MB  2.835632GB  503.3335ms  Host To Device
    2893         -         -         -           -  690.5301ms  Gpu page fault groups
Total CPU Page faults: 12289

Environment:
GTX 1080 Ti
CUDA 9.1
Ubuntu 16.04.4 LTS

relying on the page-faulting mechanism to move data is inefficient and slow, compared to a bulk copy. This is the reason that unified memory seems slow in this case.

You may wish to read this:

https://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana/40011988#40011988

Thanks for the reply.

I also find this blog, which explains the bad performance on Pascal GPU and the solution.

https://devblogs.nvidia.com/unified-memory-cuda-beginners/