template kernels (kernel "loading" by the GPU) kernel "loading" by the GPU

Following up the reduction example, I’m tempted to create a kernel similar to “Reduction #6” (template kernel with an unsigned int).
However I still have a question I would like to be answered before doing so:

kernels are loaded to the GPU only when they are called or do they reside always on the GPU? If they are loaded right away to the GPU wouldn’t it be a bad idea to have a template kernel such as “Reduction #6” ?