Any applicable kernel size constraints..?

little_jimmy · April 2, 2014, 6:17am

Hello,

Are there any kernel size limitations/ constraints that apply?
(Either the programming guides do not mention this, or I did not spot it)

Allowed kernel size would of course influence overall program structuring/ design
For instance, if I can conclude my algorithm in 2k lines of code, and can contain all necessary data in shared memory as opposed to global memory, why use multiple kernels to complete the algorithm?
The general view is that multiple kernels increase overhead

I find that, when generally exceeding 1k lines of kernel code, cuda program execution becomes erratic around/ after the mentioned mark

kbam · April 2, 2014, 7:18am

500 million instructions for anything with compute capability 2.x or higher, 2 million instructions for older GPU’s

see Wikipedia entry on Cuda for other details

njuffa · April 2, 2014, 9:47am

What exactly does “program execution becomes erratic” mean? Lengthy code could also correlate with lengthy run-time. If you are executing the kernels on a GPU that is also used for display, you may run into issues with the operating system’s watchdog timer kicking in to prevent freezing the GUI for a prolonged period of time, causing the kernel to be terminated abnormally (“killed”). I do not know exactly what the exact time limits are that are imposed by different OSes, but they tend to be in the single-digit second range, e.g. 2-5 seconds.

little_jimmy · April 2, 2014, 10:41am

With kbam’s help, I see the kernel max instruction limit is indeed in the programming guide

njuffa: “program execution becoming erratic” means, when debugging - stepping code - after a number of inlined function executions within the kernel, a simple instruction like if (i == 0) [i = threadIdx.x] hardly returns - the program gets off rails

I initially believed it to be related to the kernel instruction count, because the same instruction would be fine, earlier on in the kernel program, and the variable “i” itself is fine
But, clearly this can not be, given that I am still way beyond the max instruction limit
I also tried to reproduce the “error” by inlining a simple add function numerous times in a test kernel to push the instruction count beyond my perceived 1k count mark, but the simple if (i == 0) works perfectly fine afterwards in the test kernel

I have 2 GPUs in my pc, and the one not driving the screen adequately strong (geforce gtx 780 ti), so it can not be that either

I have now learnt that, in my original kernel, I equally manage to have threads jump a __syncthreads() call, which is really a matter of concern - I know so as, after the __syncthreads(), wasp blocks are still in places they should not be
I think it is perhaps because I do not explicitly synchronize within wasp blocks

njuffa · April 2, 2014, 5:55pm

You would want to make sure there are no __syncthreads() calls in divergent code paths, that leads to undefined behavior, which in my experience includes the observation of threads seemingly jumping the barrier.

You might want to run the application under cuda-memcheck to have it check for race conditions and out of bounds memory accesses. That is usually the first thing I do when I see something “weird” happening with a kernel.

little_jimmy · April 3, 2014, 6:15am

njuffa:

noted, thanks

I believe you are spot on in that I have managed and run into undefined behavior - what I am experiencing is not time-bound nor instruction-count-bound, but more “kernel-footprint” bound

little_jimmy · April 7, 2014, 10:14am

I have finally managed to shrug off the undefined behavior; the algorithm implementation now is a beautiful beast, and the same kernel easily remain on the device with a execution time of a few minutes

In the end, I removed all kernel function inlinings, with no kernel function being inlined at all, and increased the number of functions within the kernel, to make it cleaner

Topic		Replies	Views
limitations on repeatitive computation? CUDA Programming and Performance	5	773	August 14, 2011
CUDA kernels keep on crashing CUDA Programming and Performance	6	3644	October 27, 2008
Kernel max instructions? CUDA Programming and Performance	8	1488	February 8, 2018
maximun size of cuda kernal CUDA Programming and Performance	2	1467	April 10, 2010
limitation on recursive operations in a kernel? CUDA Programming and Performance	0	6930	August 1, 2011
What is maximum size of kernel code? CUDA Programming and Performance	2	8606	February 18, 2010
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2814	October 13, 2009
Maximum number of instructions per kernel CUDA Programming and Performance	2	954	September 11, 2014
Information about kernel execution time limit? large kernels blank the screen CUDA Programming and Performance	7	3474	September 18, 2008
Simultaneous kernel executions not possible? Disappointing news for me CUDA Programming and Performance	7	6096	November 3, 2008

Any applicable kernel size constraints..?

Related topics