Could Kernel size limit performance?

Dear All

What is the limitation of the kernel size from which the performance degrades? Has each core in each SM a maximum program size without program swap from “graphics Memory” or SM program memory ?


Luis Gonçalves

One thing to do would be to write one large kernel and compare its runtime to several smaller kernels.

It’s my guess that one larger kernel would be faster than several smaller ones, assuming the code has no global sync points in which case kernel separation would probably be your best option.

There is a maximum program size. The limit is architecture dependent and listed in an appendix of the Programming Guide, I think. I have never head of any project hitting that limit.

In practical terms, I am aware of two potential issues with large codes:

(1) Very large kernels with tens of thousands of lines of code may compile quite slowly. Template instantiation and function inlining can expand code to this size even if the actual source looks much smaller. Compilation speed for large kernels has been improved in recent CUDA versions but this may still be an issue for specific kinds of large code.

(2) The GPU has an instruction cache that is fairly small, I think 4 KB or 8KB, I do not recall exactly. Combined with the fact that the GPU has no branch prediction this leads to the following: If you have a loop whose body exceeds the size of the instruction cache, you may see some performance degradation due to instruction fetch cache misses. I have observed up to 3% slowdown, but it will depend on the use case.

Another doubt. Program compilation is done when the host program starts, no? Not at kernel launch.


Luis Gonçalves

Compilation occurs in two stages. First, the HLL code is compiled into an intermediate language called PTX, which looks a lot like an architecture-independent assembly language. In the second stage, PTX is compiled to machine code for a specific GPU architecture. There are two common scenarios:

(1) Off-line compilation. Both compilation stages occur when you run nvcc, and the resulting binary contains the machine code (SASS) for one or several GPU architectures. You can run cuobjdump --dump-sass on the executable to inspect the embedded machine code.

(2) JIT compilation. The first compilation stage occurs when you run nvcc, and the resulting binary contains the PTX code. At CUDA context initialization time, the PTX code is JIT compiled to SASS by the compiler component inside the CUDA driver. You can use cuobjdump --dump-ptx to inspect the embedded PTX code.

Variations of these two schemes are possible. For example, you could use the -code_gen switch of nvcc to generate machine code for some architectures, but PTX for others. Also, you could write your own PTX code generator, then use the CUDA driver API to compile and load this PTX code (I am aware of applications that use this for dynamic code generation).

The CUDA documentation (nvcc documentation and Programming Guide) describes the standard scenarios in sufficient detail, I think.

Thanks for the explanation. But in JIT when is made the compilation, at beginning of the host code or at launch?


Luis Gonçalves

As I stated in #5 above, in a standard JIT-compilation scenario:

Generally, the first CUDA API call in an app triggers context creation. If there is a lot of code to compile from PTX to SASS, your app may be slow to start up. Subsequent kernel launches will use the generated code. Furthermore, the generated SASS is being cached by the CUDA driver in a directory on your disk, for future use.

The one exception to the above would be an application that, during normal operation, continuously creates, compiles, and loads new code. In this case you would know exactly where the app incurs overhead as it would manually initiate compilation via appropriate CUDA driver API calls such as cuModuleLoadFatBinary().