Could Kernel size limit performance?

luisgo · December 17, 2014, 5:20pm

Dear All

What is the limitation of the kernel size from which the performance degrades? Has each core in each SM a maximum program size without program swap from “graphics Memory” or SM program memory ?

Thanks

Luis Gonçalves

MutantJohn · December 17, 2014, 5:26pm

One thing to do would be to write one large kernel and compare its runtime to several smaller kernels.

It’s my guess that one larger kernel would be faster than several smaller ones, assuming the code has no global sync points in which case kernel separation would probably be your best option.

njuffa · December 17, 2014, 5:29pm

There is a maximum program size. The limit is architecture dependent and listed in an appendix of the Programming Guide, I think. I have never head of any project hitting that limit.

In practical terms, I am aware of two potential issues with large codes:

(1) Very large kernels with tens of thousands of lines of code may compile quite slowly. Template instantiation and function inlining can expand code to this size even if the actual source looks much smaller. Compilation speed for large kernels has been improved in recent CUDA versions but this may still be an issue for specific kinds of large code.

(2) The GPU has an instruction cache that is fairly small, I think 4 KB or 8KB, I do not recall exactly. Combined with the fact that the GPU has no branch prediction this leads to the following: If you have a loop whose body exceeds the size of the instruction cache, you may see some performance degradation due to instruction fetch cache misses. I have observed up to 3% slowdown, but it will depend on the use case.

luisgo · December 17, 2014, 5:36pm

Another doubt. Program compilation is done when the host program starts, no? Not at kernel launch.

Thanks

Luis Gonçalves

njuffa · December 17, 2014, 5:49pm

Compilation occurs in two stages. First, the HLL code is compiled into an intermediate language called PTX, which looks a lot like an architecture-independent assembly language. In the second stage, PTX is compiled to machine code for a specific GPU architecture. There are two common scenarios:

(1) Off-line compilation. Both compilation stages occur when you run nvcc, and the resulting binary contains the machine code (SASS) for one or several GPU architectures. You can run cuobjdump --dump-sass on the executable to inspect the embedded machine code.

(2) JIT compilation. The first compilation stage occurs when you run nvcc, and the resulting binary contains the PTX code. At CUDA context initialization time, the PTX code is JIT compiled to SASS by the compiler component inside the CUDA driver. You can use cuobjdump --dump-ptx to inspect the embedded PTX code.

Variations of these two schemes are possible. For example, you could use the -code_gen switch of nvcc to generate machine code for some architectures, but PTX for others. Also, you could write your own PTX code generator, then use the CUDA driver API to compile and load this PTX code (I am aware of applications that use this for dynamic code generation).

The CUDA documentation (nvcc documentation and Programming Guide) describes the standard scenarios in sufficient detail, I think.

luisgo · December 17, 2014, 7:23pm

Thanks for the explanation. But in JIT when is made the compilation, at beginning of the host code or at launch?

Thanks

Luis Gonçalves

njuffa · December 17, 2014, 7:56pm

As I stated in #5 above, in a standard JIT-compilation scenario:

Generally, the first CUDA API call in an app triggers context creation. If there is a lot of code to compile from PTX to SASS, your app may be slow to start up. Subsequent kernel launches will use the generated code. Furthermore, the generated SASS is being cached by the CUDA driver in a directory on your disk, for future use.

The one exception to the above would be an application that, during normal operation, continuously creates, compiles, and loads new code. In this case you would know exactly where the app incurs overhead as it would manually initiate compilation via appropriate CUDA driver API calls such as cuModuleLoadFatBinary().

Topic		Replies	Views
Size of CUDA Object Code? CUDA Programming and Performance	5	1874	November 24, 2010
kernel size and caching CUDA Programming and Performance	3	8071	May 6, 2007
Estimating kernel size? CUDA Programming and Performance	5	2650	March 1, 2010
Any applicable kernel size constraints..? CUDA Programming and Performance	6	1299	April 7, 2014
NVIDIA people, please pay attention, still have no meaningful answer How to estimate the proximity t CUDA Programming and Performance	5	683	November 5, 2010
What is maximum size of kernel code? CUDA Programming and Performance	2	8689	February 18, 2010
Simple caching kernel yields low performance CUDA Programming and Performance	2	691	June 4, 2015
code instruction cache? CUDA Programming and Performance	12	5016	July 31, 2015
Ptxas slow CUDA Programming and Performance cuda , kernel	35	3211	May 2, 2024
How can I tell whether my kernel will thrash the instruction cache? CUDA Programming and Performance	4	793	August 21, 2022

Could Kernel size limit performance?

Related topics