Since API calls are asynchronous, it may also make sense to start loading the kernel(s) as soon as possible to overlap the initialization phases.
Even loading more kernels than strictly necessary might not cause performance degradation. Actually I was hoping an answer from Tim along the lines of : “Benchmarking shows that sending 100K through the PCIe is only marginally slower than sending 1K, and always much faster than sending 100 times 1K, so we decided to aggressively prefetch all kernels in advance.”
I guess I’ll never know. ;)
(Well, it’s certainly more complicated than that, because each kernel probably needs to be aligned on a 4K-page boundary…)