Hello:
I’m starting with CUDA, so this post may contain several “classic” errors, I hope you can help me to improve my skills :)
So, I took a C program I wrote, and made some CUDA improvement. As the C program was working with a struct made of an array of 4096 integers, I thought it may be useful to use CUDA in order to manage it.
After two busy days, I rewrote part of the code, resulting on new functions that uses CUDA; I tested it and went wonderfull when called from tester program (it calls isolated functions a hundred of times to test it).
Finally, I decided to put all together, and I found that the final result is way slower than original C program. After I made a basic profile, I find out that 80% of the calls are done by the API, causing a huge slow-down on the program.
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 25.84% 100.00ms 72890 1.3710us 1.1520us 3.7440us CUAdd(BigInteger*, BigInteger*, int)
20.90% 80.867ms 61337 1.3180us 1.0240us 3.9040us CUNormalize(BigInteger*, int)
16.93% 65.510ms 47582 1.3760us 1.1840us 3.7760us CUMove(BigInteger*, BigInteger*, int, int)
15.24% 58.963ms 37250 1.5820us 1.4400us 3.7120us CUOffset(BigInteger*, BigInteger*, int, int)
10.73% 41.524ms 25023 1.6590us 1.4720us 3.7760us CUSub(BigInteger*, BigInteger*)
10.33% 39.990ms 24087 1.6600us 1.4720us 3.6480us CUMoveOffset(BigInteger*, BigInteger*, int)
0.02% 86.687us 43 2.0150us 1.8240us 3.9040us CUclean(BigInteger*)
0.00% 11.808us 8 1.4760us 1.2480us 2.6880us setItem(BigInteger*, char*, int)
0.00% 8.6400us 5 1.7280us 1.5680us 2.3680us CURev(BigInteger*, int)
API calls: 57.75% 34.9232s 268225 130.20us 6.6000us 10.525ms cudaLaunchKernel
16.26% 9.83558s 159306 61.740us 23.000us 485.00us cudaDeviceSynchronize
15.35% 9.28192s 318556 29.137us 20.400us 526.70us cudaFree
10.55% 6.37982s 318564 20.026us 15.700us 140.56ms cudaMallocManaged
0.08% 50.404ms 1 50.404ms 50.404ms 50.404ms cuDevicePrimaryCtxRelease
0.00% 1.0868ms 2 543.40us 37.400us 1.0494ms cuModuleUnload
0.00% 192.10us 97 1.9800us 100ns 85.400us cuDeviceGetAttribute
0.00% 22.700us 1 22.700us 22.700us 22.700us cuDeviceTotalMem
0.00% 12.200us 1 12.200us 12.200us 12.200us cuDeviceGetPCIBusId
0.00% 1.6000us 3 533ns 200ns 800ns cuDeviceGetCount
0.00% 1.5000us 2 750ns 300ns 1.2000us cuDeviceGet
0.00% 1.0000us 1 1.0000us 1.0000us 1.0000us cuDeviceGetName
0.00% 400ns 1 400ns 400ns 400ns cuDeviceGetUuid
0.00% 400ns 1 400ns 400ns 400ns cuDeviceGetLuid
==4284== Unified Memory profiling result:
Device "GeForce 920M (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
1472098 4.0000KB 4.0000KB 4.0000KB 5.615608GB 17.529751s Host To Device
455384 19.927KB 4.0000KB 20.000KB 8.654129GB 16.990673s Device To Host
I would like to point that whole minute lost on the four first API calls. I know that they are kinda related - as more I launch a kernel, the most I have to clean memory (cudafree).
So, while I’m going to review the program flow to make that it makes less cpu <> gpu working, I’m wondering if there is any kind of advice I can take to minimize the timings that the API is giving - maybe some parametrization, maybe an advice about blocks?
Thank you.