How does cuda cache kernel launches?

I’m new to CUDA and I’m using the nvcc environment to develop a particle system.

One thing that immediately comes to mind is I’m unsure how the GPU loads and stores the kernel code. What triggers the upload of the code and is it cached on the GPU?

In particular, if I launch several kernels in a loop, does the device code persist on the GPU?

For example, in this loop, when does the Fn0, Fn1 and Fn2 get loaded onto the GPU? Is Fn1 and Fn2 reloaded on each iteration?

Fn0 <<<Dg, Db, Ns >>>(a,b,c);
for (int i = 0; i < 10000; i++)
{
Fn1 <<<Dg, Db, Ns >>>(d,e,f);
Fn2 <<<Dg, Db, Ns >>>(h, j, k);
}

The cuda runtime takes care of that for you (all kernels are loaded onto the device and persisted). In the rare case that this is taking up a noticeable amount of memory, you can use the driver api to get fine grained control of what is loaded and unloaded.