I have two questions, which I could not solve with the help of the documentation.
First: How largely may a kernel be? Which bounds exist?
Second: kernels become cached. So you can warm up your kernel by starting twice. How many kernels will be chached? Is there a “first in-last out” - strategy?
The limit on kernel size is 2MB of native instructions. In practice this is not much of a limitation. We’re not aware of anybody who has hit this yet.
Kernels aren’t really cached, the instructions are stored in video memory. The reason we “warm up” the kernels sometimes is that the CUDA runtime performs some initialization the first time you execute a kernel, which can affect timing.