As far as I have seen, passing more parameters to a kernel does make it run a LOT slower.
The slowdown is the same no matter if that para resides in host or device memory. Am I right thinking that those paras get implicitly copied to device / the MPs / wherever they r needed? I have even seen kernels slowing down from 0.050 ms to 0.080 cause of only one additional passed int which is quite strange, I think.
So im wondering if theres a good possibilty to bring in some optimizations in case u need more paras. Sth like coalescing - passing one struct or array including all paras instead of copying them from different “locations”.
Any suggestions or ideas on whats going on inside the black box?
0.03ms x 400 kernel launches… then it becomes “a lot slower” ;)
Anyway, I removed a redundant paramater from some of my kernel kalls. There are about 150 calls of those kernels, with very little host code in between. Nevertheless I observed no performance gain, compared with calls with the redundant parameter present.
This 0.03 are a significant increase in comparison to the execution time without that parameter (time needed goes up by 62.5 %)! Dont forget that it is always problemdependant how crucial this increased time is…
The time was measured by cudaEvents, and there was still a lotta free shared mem available.
This was just an example to show what a big difference one more para can be. In contrast to this doubling the computations performed by each threads means a increase of only say 10%…
So I think developers have to think about reducing the number of paras passed to kernels if possible to improve performance.
It’s not that I have 400 different kernels, no :) It is number of kernel calls!
Each call takes 3 or 4 parameters, do some serious computations and in total it takes me about 40ms on GTX285 to have my work done. So even if calling kernels and passing parameters takes some time (well, it has to) it is not that severe and painful as one could think.
This confirms what i suspected cause i see that if i launch my app a second time, it is way faster than before. So I think theres something stored in memory between multiple programruns and kernelexecutions as you told us, too.
I doubt there is anything stored as GPU has no way of knowing if host variable that you are passing as an argument has changed or not. However there could be the case that something is initialised and that takes some time.
Its no that something is stored as such, its as a result of initiallization.
You see, kernel initialization takes extra time, as doues the initiallization of cudamemcpy.
If you think it out, you bandwidth test (in the nvidia sdk) shows you that you host to device and device to host have a bandwidth of aprox 1.5-2 GB/s. This is true, but what you dont see is that the initialization takes a long time. This can be seen when you try to copy 50 KB to the device, and when you copy 500 MB to the device. According to the bandwidth test, 50 KB should take 0.025 ms and the 500 MB should take 0.25 seconds, but you find that actually, the 50 KB memcpy takes a los more time, and the difference between the copy time of both sizes is very little.
So when we are performing cudamemcpy, we have a large initialization time, but once we overcome the initiallization is actually quite fast.
The same happens with kernel initiallization.
¿1 does 1 kernel invocation take 10ms, and 100 kernel invocations takes 17 ms?
Take into account that we have to copy the kernel source code to the different cores of the gpu. This takes a long time. But once it is copied, the actual kernel execution is lightning fast.
So in short:
1 Kernel invocation:
9,8ms to initialize cores with kernel code, copying to the 240 cores of the gpu + 0,2 ms execution.
100 kernel invocations:
9.8ms to initialize+ 100*0.2ms execution.
This is why the code available in NVIDIA SDK, uses the term “warming up gpu”. If you look at the source code, thay execute the kernel once (this is the first time, thus it initializes, and if you notice: They dont count this time), and then they execute the kerne 1000 times.
Even so, your code seems perfect for CUDA gpu, as you shall only suffer the kernel initialization once, and the rest of the times, you will be able to get the best of the performance offered by CUDA and nvidias gpus.