When are kernels loaded on to the device mem? Any CUDA Guru knows this

Does anyone know the answer to this questions:

When are kernels loaded on to the device memory? Are they loaded as soon as the host program starts or are they loaded only when invoked?

Already post this on the forum CUDA Programming and Development, but nobody seems to be able to answer this.

They’re loaded at module load time. If you’re using CUDART, that’s when the context is created.

Did this behavior change since CUDA 1.1?

My experiments with renouveau at that time where suggesting that it was only loaded after the cuLaunchGrid call.

Anyway, this means that if I use only one kernel in a module containing a few hundred kernels (typical of libraries using templates like CUBLAS, CUDPP…), all kernels will be loaded to device memory?

(I don’t say it’s a bad thing, just asking…)

I sure hope kernels are loaded only when they are called (in runtime). This is what makes sense for me.

However I would like to know this for sure.

Anyway, thank you Sylvain Collange and tmurray

Since API calls are asynchronous, it may also make sense to start loading the kernel(s) as soon as possible to overlap the initialization phases.

Even loading more kernels than strictly necessary might not cause performance degradation. Actually I was hoping an answer from Tim along the lines of : “Benchmarking shows that sending 100K through the PCIe is only marginally slower than sending 1K, and always much faster than sending 100 times 1K, so we decided to aggressively prefetch all kernels in advance.”

I guess I’ll never know. ;)

(Well, it’s certainly more complicated than that, because each kernel probably needs to be aligned on a 4K-page boundary…)

Well, that’s obviously true–there’s some driver overhead associated with memory allocation, etc., so doing it all in one go instead of X times is certainly faster.