Bad performance when using unified memory

Yiqi · April 21, 2019, 3:09am

I’m doing experiments using unified memory. The code just performs simple matrix addition.

I record the exetution time for:

CPU execution
GPU execution using manual memory copy (1) data copy time (2) kernel execution time
full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch2/sumMatricesOnDevice.cu

start = clock();
cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);
end = clock();
double copyTime = ((double) (end - start)) / CLOCKS_PER_SEC;
...
start = clock();
sumArraysOnDevice<<<grid, block>>>(d_A, d_B, d_C, nx, ny);
cudaDeviceSynchronize();
end = clock();
double gpuTime = ((double) (end - start)) / CLOCKS_PER_SEC;

GPU execution using unified memory
full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch4/sumMatricesUnify.cu

start = clock();
sumArraysOnDevice<<<grid, block>>>(A, B, C_gpu, nx, ny);
cudaDeviceSynchronize();
end = clock();
double gpuTime = ((double) (end - start)) / CLOCKS_PER_SEC;

The results show that using unified memory leads to worse performance than even CPU execution:
CPU: 0.6054s
GPU: data copy 0.3437 s + execution 0.0094 s
GPU unify: 0.6901

I also profiled GPU-unified-memory kernel:

==18742== Profiling application: ./sumMatricesUnify
==18742== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  733.00ms         1  733.00ms  733.00ms  733.00ms  sumArraysOnDevice(float*, float*, float*, int, int)
      API calls:   67.73%  733.06ms         1  733.06ms  733.06ms  733.06ms  cudaDeviceSynchronize
                   15.11%  163.52ms         4  40.880ms  20.078us  163.42ms  cudaMallocManaged
                   12.52%  135.54ms         4  33.884ms  30.281ms  39.130ms  cudaFree
                    4.54%  49.172ms         1  49.172ms  49.172ms  49.172ms  cudaDeviceReset
                    0.04%  439.77us        94  4.6780us      97ns  216.11us  cuDeviceGetAttribute
                    0.04%  393.58us         1  393.58us  393.58us  393.58us  cudaGetDeviceProperties
                    0.01%  90.503us         1  90.503us  90.503us  90.503us  cuDeviceTotalMem
                    0.01%  55.776us         1  55.776us  55.776us  55.776us  cudaLaunch
                    0.00%  40.911us         1  40.911us  40.911us  40.911us  cuDeviceGetName
                    0.00%  10.125us         1  10.125us  10.125us  10.125us  cudaConfigureCall
                    0.00%  3.2900us         1  3.2900us  3.2900us  3.2900us  cudaSetDevice
                    0.00%  1.8260us         5     365ns      75ns     967ns  cudaSetupArgument
                    0.00%  1.1570us         3     385ns     108ns     892ns  cuDeviceGetCount
                    0.00%  1.1310us         1  1.1310us  1.1310us  1.1310us  cudaGetLastError
                    0.00%     532ns         2     266ns     111ns     421ns  cuDeviceGet

==18742== Unified Memory profiling result:
Device "GeForce GTX 1080 Ti (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   49149  60.497KB  4.0000KB  0.9961MB  2.835632GB  503.3335ms  Host To Device
    2893         -         -         -           -  690.5301ms  Gpu page fault groups
Total CPU Page faults: 12289

Environment:
GTX 1080 Ti
CUDA 9.1
Ubuntu 16.04.4 LTS

Robert_Crovella · April 21, 2019, 4:16am

relying on the page-faulting mechanism to move data is inefficient and slow, compared to a bulk copy. This is the reason that unified memory seems slow in this case.

You may wish to read this:

[url]c++ - Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged - Stack Overflow

Yiqi · April 21, 2019, 4:53am

Thanks for the reply.

I also find this blog, which explains the bad performance on Pascal GPU and the solution.

[url]https://devblogs.nvidia.com/unified-memory-cuda-beginners/[/url]

Topic		Replies	Views
Why does it take longer for a program to use Unified Memory than not to use Uuified Memoery? Jetson AGX Xavier cuda	3	468	October 18, 2021
Why does it take longer for a program to use Unified Memory than not to use Uuified Memoery? Jetson AGX Xavier cuda	2	356	October 18, 2021
Significant performance problem with Unified Memory based on driver version CUDA Programming and Performance	2	1472	July 31, 2018
Unified Memory for CUDA Beginners Technical Blog	46	3042	December 1, 2023
Unified Memory has poor performance on Jetson AGX Xavier Jetson AGX Xavier cuda	6	1294	February 9, 2022
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4584	November 29, 2014
Unified Memory Access Performance of Arrays of Structures Problem on Jetson TX2 Jetson TX2 cuda	5	734	October 18, 2021
Extreme performance degradation (<1/100) due to allocate unified memory area? CUDA Programming and Performance	0	542	June 19, 2017
Dual problems with unified memory Jetson Nano	8	1326	October 14, 2021
Unified Memory in Pascal CUDA Programming and Performance	5	954	September 24, 2018

Bad performance when using unified memory

Related topics