I’m doing experiments using unified memory. The code just performs simple matrix addition.
I record the exetution time for:
-
CPU execution
-
GPU execution using manual memory copy (1) data copy time (2) kernel execution time
full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch2/sumMatricesOnDevice.cu
start = clock();
cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);
end = clock();
double copyTime = ((double) (end - start)) / CLOCKS_PER_SEC;
...
start = clock();
sumArraysOnDevice<<<grid, block>>>(d_A, d_B, d_C, nx, ny);
cudaDeviceSynchronize();
end = clock();
double gpuTime = ((double) (end - start)) / CLOCKS_PER_SEC;
- GPU execution using unified memory
full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch4/sumMatricesUnify.cu
start = clock();
sumArraysOnDevice<<<grid, block>>>(A, B, C_gpu, nx, ny);
cudaDeviceSynchronize();
end = clock();
double gpuTime = ((double) (end - start)) / CLOCKS_PER_SEC;
The results show that using unified memory leads to worse performance than even CPU execution:
CPU: 0.6054s
GPU: data copy 0.3437 s + execution 0.0094 s
GPU unify: 0.6901
I also profiled GPU-unified-memory kernel:
==18742== Profiling application: ./sumMatricesUnify
==18742== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 733.00ms 1 733.00ms 733.00ms 733.00ms sumArraysOnDevice(float*, float*, float*, int, int)
API calls: 67.73% 733.06ms 1 733.06ms 733.06ms 733.06ms cudaDeviceSynchronize
15.11% 163.52ms 4 40.880ms 20.078us 163.42ms cudaMallocManaged
12.52% 135.54ms 4 33.884ms 30.281ms 39.130ms cudaFree
4.54% 49.172ms 1 49.172ms 49.172ms 49.172ms cudaDeviceReset
0.04% 439.77us 94 4.6780us 97ns 216.11us cuDeviceGetAttribute
0.04% 393.58us 1 393.58us 393.58us 393.58us cudaGetDeviceProperties
0.01% 90.503us 1 90.503us 90.503us 90.503us cuDeviceTotalMem
0.01% 55.776us 1 55.776us 55.776us 55.776us cudaLaunch
0.00% 40.911us 1 40.911us 40.911us 40.911us cuDeviceGetName
0.00% 10.125us 1 10.125us 10.125us 10.125us cudaConfigureCall
0.00% 3.2900us 1 3.2900us 3.2900us 3.2900us cudaSetDevice
0.00% 1.8260us 5 365ns 75ns 967ns cudaSetupArgument
0.00% 1.1570us 3 385ns 108ns 892ns cuDeviceGetCount
0.00% 1.1310us 1 1.1310us 1.1310us 1.1310us cudaGetLastError
0.00% 532ns 2 266ns 111ns 421ns cuDeviceGet
==18742== Unified Memory profiling result:
Device "GeForce GTX 1080 Ti (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
49149 60.497KB 4.0000KB 0.9961MB 2.835632GB 503.3335ms Host To Device
2893 - - - - 690.5301ms Gpu page fault groups
Total CPU Page faults: 12289
Environment:
GTX 1080 Ti
CUDA 9.1
Ubuntu 16.04.4 LTS