Performance leakage due excessive API times


I’m starting with CUDA, so this post may contain several “classic” errors, I hope you can help me to improve my skills :)

So, I took a C program I wrote, and made some CUDA improvement. As the C program was working with a struct made of an array of 4096 integers, I thought it may be useful to use CUDA in order to manage it.

After two busy days, I rewrote part of the code, resulting on new functions that uses CUDA; I tested it and went wonderfull when called from tester program (it calls isolated functions a hundred of times to test it).

Finally, I decided to put all together, and I found that the final result is way slower than original C program. After I made a basic profile, I find out that 80% of the calls are done by the API, causing a huge slow-down on the program.

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   25.84%  100.00ms     72890  1.3710us  1.1520us  3.7440us  CUAdd(BigInteger*, BigInteger*, int)
                   20.90%  80.867ms     61337  1.3180us  1.0240us  3.9040us  CUNormalize(BigInteger*, int)
                   16.93%  65.510ms     47582  1.3760us  1.1840us  3.7760us  CUMove(BigInteger*, BigInteger*, int, int)
                   15.24%  58.963ms     37250  1.5820us  1.4400us  3.7120us  CUOffset(BigInteger*, BigInteger*, int, int)
                   10.73%  41.524ms     25023  1.6590us  1.4720us  3.7760us  CUSub(BigInteger*, BigInteger*)
                   10.33%  39.990ms     24087  1.6600us  1.4720us  3.6480us  CUMoveOffset(BigInteger*, BigInteger*, int)
                    0.02%  86.687us        43  2.0150us  1.8240us  3.9040us  CUclean(BigInteger*)
                    0.00%  11.808us         8  1.4760us  1.2480us  2.6880us  setItem(BigInteger*, char*, int)
                    0.00%  8.6400us         5  1.7280us  1.5680us  2.3680us  CURev(BigInteger*, int)
      API calls:   57.75%  34.9232s    268225  130.20us  6.6000us  10.525ms  cudaLaunchKernel
                   16.26%  9.83558s    159306  61.740us  23.000us  485.00us  cudaDeviceSynchronize
                   15.35%  9.28192s    318556  29.137us  20.400us  526.70us  cudaFree
                   10.55%  6.37982s    318564  20.026us  15.700us  140.56ms  cudaMallocManaged
                    0.08%  50.404ms         1  50.404ms  50.404ms  50.404ms  cuDevicePrimaryCtxRelease
                    0.00%  1.0868ms         2  543.40us  37.400us  1.0494ms  cuModuleUnload
                    0.00%  192.10us        97  1.9800us     100ns  85.400us  cuDeviceGetAttribute
                    0.00%  22.700us         1  22.700us  22.700us  22.700us  cuDeviceTotalMem
                    0.00%  12.200us         1  12.200us  12.200us  12.200us  cuDeviceGetPCIBusId
                    0.00%  1.6000us         3     533ns     200ns     800ns  cuDeviceGetCount
                    0.00%  1.5000us         2     750ns     300ns  1.2000us  cuDeviceGet
                    0.00%  1.0000us         1  1.0000us  1.0000us  1.0000us  cuDeviceGetName
                    0.00%     400ns         1     400ns     400ns     400ns  cuDeviceGetUuid
                    0.00%     400ns         1     400ns     400ns     400ns  cuDeviceGetLuid

==4284== Unified Memory profiling result:
Device "GeForce 920M (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
 1472098  4.0000KB  4.0000KB  4.0000KB  5.615608GB  17.529751s  Host To Device
  455384  19.927KB  4.0000KB  20.000KB  8.654129GB  16.990673s  Device To Host

I would like to point that whole minute lost on the four first API calls. I know that they are kinda related - as more I launch a kernel, the most I have to clean memory (cudafree).

So, while I’m going to review the program flow to make that it makes less cpu <> gpu working, I’m wondering if there is any kind of advice I can take to minimize the timings that the API is giving - maybe some parametrization, maybe an advice about blocks?

Thank you.

most of your kernels average 2 microseconds or less. That is a really bad work partition or else an application unfit for GPU acceleration. It costs a few microseconds to launch a kernel. If you kernel runs for 2 microseconds, that is really bad. If your kernel runs for 1000 microseconds, not a big deal.

Are you allocating/deallocating memory in a loop? A really bad idea - a bad structure for program performance.
Another guess is maybe you’ve used managed memory in an object-oriented way (e.g. overloading new and delete for objects, or else via constructor/destructor), and that can be a recipe for inefficiency.

I’m guessing your overall application run time is on the order of 60 seconds, and about 30 seconds of that is spent moving data between host and device. The movement of data is fairly inefficient - e.g. 8 gigabytes in 16 seconds, whereas if you moved it all at once or efficiently it would probably take about a second or 2.


So, to sum up, I understand is better to have “big procedures” that run into CUDA that small “functions” right?

I mean, if I need to do


It can be better if both functions run as one unique kernel, as there will be only one call to API

That means rewrite almost all the code I did on C, but if it suits to make it work, I surely give it a try.


I wouldn’t say that sums it up, no. I pointed out at least 2 or 3 concepts. You picked one. I would not call that one concept a summary.

Furthermore, it’s like a puzzle trying to explain application behavior with no source code and no knowledge of even the basic problem being solved or design approach. So presenting just the summary profiler output (not even an application trace - which I don’t want to see for 300,000 kernel calls anyway) is like trying to guess what book you are reading based on a few words. It seems like to have a sensible dialog, it would be appropriate for you to share more information. That’s your choice, of course, but without it, my observations are pretty much just speculation. I have no idea what is actually going on in your application. Simple things like confirmation/rejection of any of my conjectures might move the conversation along in a more sensible way.

But with respect to the item you picked up on, yes, kernel “fusion” is a real thing in CUDA, or alternatively “fusion” of operations. You can google those concepts easily. And the purpose is to amortize fixed overheads over larger amounts of work, so that the overheads don’t dominate application behavior, as they appear to be doing here.

Anyway, I probably won’t be able to speculate further. Good luck!


As far as I may rewrite my code to solve the point of “fusión” I think the whole situation will change, and probably the other two points you showed may change - even dissappear.

I’m wanted to keep it short as I’m writing from my phone and this way is hard to post large amounts of text or code.

I will try to minimise the kernel calls so, surely I will be OK to share new information and/or code once is done.

Thanks for your time and help, I will write soon with the outcome.



I started rewriting my code, and did a simple test executing 1.000.000 times a simple function

Before I had:

void main(){
  for(i = 0;i < 1000000;i++){

  //making cudaMalloc, etc
  myKernel<<<a, b>>>();
  //making sync, freeing memory

And now:

void main(){
  //makind cudaMalloc, etc

  for(i = 0;i < 1000000;i++){
    myKernel<<<a, b>>>();

  //freeing memory

And the difference is clear enough.

This “small” change, also reduced the API calls, so timing is way better.

In conclussion

  1. You were right on all your points
  2. It was right to wait until doing a test before getting on the other two points

Again, thanks for your help!