Do more parameters passed to kernel make it slower?

ONeill · December 15, 2009, 10:08am

As far as I have seen, passing more parameters to a kernel does make it run a LOT slower.
The slowdown is the same no matter if that para resides in host or device memory. Am I right thinking that those paras get implicitly copied to device / the MPs / wherever they r needed? I have even seen kernels slowing down from 0.050 ms to 0.080 cause of only one additional passed int which is quite strange, I think.
So im wondering if theres a good possibilty to bring in some optimizations in case u need more paras. Sth like coalescing - passing one struct or array including all paras instead of copying them from different “locations”.

Any suggestions or ideas on whats going on inside the black box?

_Big_Mac · December 15, 2009, 10:46am

Kernel arguments land in shared memory. If you already use smem, you might run into issues with occupancy.

And yes, obviously parameters need to be copied to the device.

Do you consider 0.03ms “a LOT slower”? How do you measure that time anyway?

Cygnus_X1 · December 15, 2009, 11:25am

0.03ms x 400 kernel launches… then it becomes “a lot slower” ;)

Anyway, I removed a redundant paramater from some of my kernel kalls. There are about 150 calls of those kernels, with very little host code in between. Nevertheless I observed no performance gain, compared with calls with the redundant parameter present.

ONeill · December 15, 2009, 1:19pm

This 0.03 are a significant increase in comparison to the execution time without that parameter (time needed goes up by 62.5 %)! Dont forget that it is always problemdependant how crucial this increased time is…

The time was measured by cudaEvents, and there was still a lotta free shared mem available.

This was just an example to show what a big difference one more para can be. In contrast to this doubling the computations performed by each threads means a increase of only say 10%…

So I think developers have to think about reducing the number of paras passed to kernels if possible to improve performance.

ONeill · December 15, 2009, 1:23pm

Yeah looks like it doesnt increase every kernels execution time.

And hell u have to be in deep love with that cute bunch o kernels u got there External Media

Cygnus_X1 · December 15, 2009, 2:40pm

It’s not that I have 400 different kernels, no :) It is number of kernel calls!

Each call takes 3 or 4 parameters, do some serious computations and in total it takes me about 40ms on GTX285 to have my work done. So even if calling kernels and passing parameters takes some time (well, it has to) it is not that severe and painful as one could think.

David_Lisin · December 17, 2009, 12:16pm

Hi,

the extra time is probably as an result of the cudamemcpy of the extra parameter, and not the kernel launch.

So you will probably find that:

CUDA execution with 4 params, and 1 kernel call:

0.03ms

CUDA execution with 5 params, and 1 kernel call:

0.08ms

BUT!!!

CUDA execution with 4 params, and 400 kernel call:

40ms

CUDA execution with 5 params, and 400 kernel call:

41ms

If you could try this out, you may find that it is not that much slower, and the extra time is as an result of the cudamemcpy.

You can confirm this using the visual profiler.

Kind regards,

David Lisin

ONeill · December 17, 2009, 1:37pm

This confirms what i suspected cause i see that if i launch my app a second time, it is way faster than before. So I think theres something stored in memory between multiple programruns and kernelexecutions as you told us, too.

Cygnus_X1 · December 17, 2009, 2:37pm

I doubt there is anything stored as GPU has no way of knowing if host variable that you are passing as an argument has changed or not. However there could be the case that something is initialised and that takes some time.

David_Lisin · December 17, 2009, 3:22pm

Its no that something is stored as such, its as a result of initiallization.

You see, kernel initialization takes extra time, as doues the initiallization of cudamemcpy.

If you think it out, you bandwidth test (in the nvidia sdk) shows you that you host to device and device to host have a bandwidth of aprox 1.5-2 GB/s. This is true, but what you dont see is that the initialization takes a long time. This can be seen when you try to copy 50 KB to the device, and when you copy 500 MB to the device. According to the bandwidth test, 50 KB should take 0.025 ms and the 500 MB should take 0.25 seconds, but you find that actually, the 50 KB memcpy takes a los more time, and the difference between the copy time of both sizes is very little.

So when we are performing cudamemcpy, we have a large initialization time, but once we overcome the initiallization is actually quite fast.

The same happens with kernel initiallization.

Â¿1 does 1 kernel invocation take 10ms, and 100 kernel invocations takes 17 ms?

Take into account that we have to copy the kernel source code to the different cores of the gpu. This takes a long time. But once it is copied, the actual kernel execution is lightning fast.

So in short:

1 Kernel invocation:

9,8ms to initialize cores with kernel code, copying to the 240 cores of the gpu + 0,2 ms execution.

100 kernel invocations:

9.8ms to initialize+ 100*0.2ms execution.

This is why the code available in NVIDIA SDK, uses the term “warming up gpu”. If you look at the source code, thay execute the kernel once (this is the first time, thus it initializes, and if you notice: They dont count this time), and then they execute the kerne 1000 times.

Even so, your code seems perfect for CUDA gpu, as you shall only suffer the kernel initialization once, and the rest of the times, you will be able to get the best of the performance offered by CUDA and nvidias gpus.

Kind regards,

David Lisin

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25109	March 8, 2010
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Performance leakage due excessive API times CUDA Programming and Performance	5	654	May 24, 2019
Looking for kernel performance suggestions CUDA Programming and Performance	12	58	August 23, 2024
First kernel execution takes longer CUDA Programming and Performance	8	2867	December 8, 2014
CUDA Profiler CUDA Programming and Performance	7	12746	October 18, 2010
CUDA 12.1 Supports Large Kernel Parameters Technical Blog	4	613	September 12, 2024
slow kernel CUDA Programming and Performance	4	1447	June 25, 2009
Slow memory transfers CUDA Programming and Performance	7	1998	May 23, 2011
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009

Do more parameters passed to kernel make it slower?

Related topics