Any applicable kernel size constraints..?

Hello,

Are there any kernel size limitations/ constraints that apply?
(Either the programming guides do not mention this, or I did not spot it)

Allowed kernel size would of course influence overall program structuring/ design
For instance, if I can conclude my algorithm in 2k lines of code, and can contain all necessary data in shared memory as opposed to global memory, why use multiple kernels to complete the algorithm?
The general view is that multiple kernels increase overhead

I find that, when generally exceeding 1k lines of kernel code, cuda program execution becomes erratic around/ after the mentioned mark

500 million instructions for anything with compute capability 2.x or higher, 2 million instructions for older GPU’s

see Wikipedia entry on Cuda for other details

What exactly does “program execution becomes erratic” mean? Lengthy code could also correlate with lengthy run-time. If you are executing the kernels on a GPU that is also used for display, you may run into issues with the operating system’s watchdog timer kicking in to prevent freezing the GUI for a prolonged period of time, causing the kernel to be terminated abnormally (“killed”). I do not know exactly what the exact time limits are that are imposed by different OSes, but they tend to be in the single-digit second range, e.g. 2-5 seconds.

With kbam’s help, I see the kernel max instruction limit is indeed in the programming guide

njuffa: “program execution becoming erratic” means, when debugging - stepping code - after a number of inlined function executions within the kernel, a simple instruction like if (i == 0) [i = threadIdx.x] hardly returns - the program gets off rails

I initially believed it to be related to the kernel instruction count, because the same instruction would be fine, earlier on in the kernel program, and the variable “i” itself is fine
But, clearly this can not be, given that I am still way beyond the max instruction limit
I also tried to reproduce the “error” by inlining a simple add function numerous times in a test kernel to push the instruction count beyond my perceived 1k count mark, but the simple if (i == 0) works perfectly fine afterwards in the test kernel

I have 2 GPUs in my pc, and the one not driving the screen adequately strong (geforce gtx 780 ti), so it can not be that either

I have now learnt that, in my original kernel, I equally manage to have threads jump a __syncthreads() call, which is really a matter of concern - I know so as, after the __syncthreads(), wasp blocks are still in places they should not be
I think it is perhaps because I do not explicitly synchronize within wasp blocks

You would want to make sure there are no __syncthreads() calls in divergent code paths, that leads to undefined behavior, which in my experience includes the observation of threads seemingly jumping the barrier.

You might want to run the application under cuda-memcheck to have it check for race conditions and out of bounds memory accesses. That is usually the first thing I do when I see something “weird” happening with a kernel.

njuffa:

noted, thanks

I believe you are spot on in that I have managed and run into undefined behavior - what I am experiencing is not time-bound nor instruction-count-bound, but more “kernel-footprint” bound

I have finally managed to shrug off the undefined behavior; the algorithm implementation now is a beautiful beast, and the same kernel easily remain on the device with a execution time of a few minutes

In the end, I removed all kernel function inlinings, with no kernel function being inlined at all, and increased the number of functions within the kernel, to make it cleaner