Do more parameters passed to kernel make it slower?

As far as I have seen, passing more parameters to a kernel does make it run a LOT slower.
The slowdown is the same no matter if that para resides in host or device memory. Am I right thinking that those paras get implicitly copied to device / the MPs / wherever they r needed? I have even seen kernels slowing down from 0.050 ms to 0.080 cause of only one additional passed int which is quite strange, I think.
So im wondering if theres a good possibilty to bring in some optimizations in case u need more paras. Sth like coalescing - passing one struct or array including all paras instead of copying them from different “locations”.

Any suggestions or ideas on whats going on inside the black box?

Kernel arguments land in shared memory. If you already use smem, you might run into issues with occupancy.

And yes, obviously parameters need to be copied to the device.

Do you consider 0.03ms “a LOT slower”? How do you measure that time anyway?

0.03ms x 400 kernel launches… then it becomes “a lot slower” ;)

Anyway, I removed a redundant paramater from some of my kernel kalls. There are about 150 calls of those kernels, with very little host code in between. Nevertheless I observed no performance gain, compared with calls with the redundant parameter present.

This 0.03 are a significant increase in comparison to the execution time without that parameter (time needed goes up by 62.5 %)! Dont forget that it is always problemdependant how crucial this increased time is…

The time was measured by cudaEvents, and there was still a lotta free shared mem available.

This was just an example to show what a big difference one more para can be. In contrast to this doubling the computations performed by each threads means a increase of only say 10%…

So I think developers have to think about reducing the number of paras passed to kernels if possible to improve performance.

Yeah looks like it doesnt increase every kernels execution time.

And hell u have to be in deep love with that cute bunch o kernels u got there :haha:

It’s not that I have 400 different kernels, no :) It is number of kernel calls!

Each call takes 3 or 4 parameters, do some serious computations and in total it takes me about 40ms on GTX285 to have my work done. So even if calling kernels and passing parameters takes some time (well, it has to) it is not that severe and painful as one could think.

Hi,

the extra time is probably as an result of the cudamemcpy of the extra parameter, and not the kernel launch.

So you will probably find that:

CUDA execution with 4 params, and 1 kernel call:

0.03ms

CUDA execution with 5 params, and 1 kernel call:

0.08ms

BUT!!!

CUDA execution with 4 params, and 400 kernel call:

40ms

CUDA execution with 5 params, and 400 kernel call:

41ms

If you could try this out, you may find that it is not that much slower, and the extra time is as an result of the cudamemcpy.

You can confirm this using the visual profiler.

Kind regards,

David Lisin

This confirms what i suspected cause i see that if i launch my app a second time, it is way faster than before. So I think theres something stored in memory between multiple programruns and kernelexecutions as you told us, too.

I doubt there is anything stored as GPU has no way of knowing if host variable that you are passing as an argument has changed or not. However there could be the case that something is initialised and that takes some time.

Its no that something is stored as such, its as a result of initiallization.

You see, kernel initialization takes extra time, as doues the initiallization of cudamemcpy.

If you think it out, you bandwidth test (in the nvidia sdk) shows you that you host to device and device to host have a bandwidth of aprox 1.5-2 GB/s. This is true, but what you dont see is that the initialization takes a long time. This can be seen when you try to copy 50 KB to the device, and when you copy 500 MB to the device. According to the bandwidth test, 50 KB should take 0.025 ms and the 500 MB should take 0.25 seconds, but you find that actually, the 50 KB memcpy takes a los more time, and the difference between the copy time of both sizes is very little.

So when we are performing cudamemcpy, we have a large initialization time, but once we overcome the initiallization is actually quite fast.

The same happens with kernel initiallization.

¿1 does 1 kernel invocation take 10ms, and 100 kernel invocations takes 17 ms?

Take into account that we have to copy the kernel source code to the different cores of the gpu. This takes a long time. But once it is copied, the actual kernel execution is lightning fast.

So in short:

1 Kernel invocation:

9,8ms to initialize cores with kernel code, copying to the 240 cores of the gpu + 0,2 ms execution.

100 kernel invocations:

9.8ms to initialize+ 100*0.2ms execution.

This is why the code available in NVIDIA SDK, uses the term “warming up gpu”. If you look at the source code, thay execute the kernel once (this is the first time, thus it initializes, and if you notice: They dont count this time), and then they execute the kerne 1000 times.

Even so, your code seems perfect for CUDA gpu, as you shall only suffer the kernel initialization once, and the rest of the times, you will be able to get the best of the performance offered by CUDA and nvidias gpus.

Kind regards,

David Lisin