My experiments with renouveau at that time where suggesting that it was only loaded after the cuLaunchGrid call.
Anyway, this means that if I use only one kernel in a module containing a few hundred kernels (typical of libraries using templates like CUBLAS, CUDPP…), all kernels will be loaded to device memory?
Since API calls are asynchronous, it may also make sense to start loading the kernel(s) as soon as possible to overlap the initialization phases.
Even loading more kernels than strictly necessary might not cause performance degradation. Actually I was hoping an answer from Tim along the lines of : “Benchmarking shows that sending 100K through the PCIe is only marginally slower than sending 1K, and always much faster than sending 100 times 1K, so we decided to aggressively prefetch all kernels in advance.”
I guess I’ll never know. ;)
(Well, it’s certainly more complicated than that, because each kernel probably needs to be aligned on a 4K-page boundary…)
Well, that’s obviously true–there’s some driver overhead associated with memory allocation, etc., so doing it all in one go instead of X times is certainly faster.