Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.
I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…
So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).
This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).
I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)