I have a median/select Nth code that needs global synchronization, hence multiple kernel launches, but want to reduce launch overheads.
I timed 10000 launches (gridDim=1, blockDim=32) of a kernel that does nothing and it took 10^-1 s (Tesla C2050). That means it takes 10^-5 s per call. But my median/select Nth function is expected to run on the order of 10^-4s, so the launch premium seems pretty big.
Can someone give me a break down of how that time is spent. How much is for:
- computation (nothing?)
- waiting for memory writes to complete
- other stuff?
Only NV guys can give you… And, its not going to help you in any way too…, unless they fix it in next release…
May be, FERMI can be of help here … with support for multiple kernels…
Are you using streams ? There seem to be an extra 25Âµs overhead (that may vary) for each call using a non zero stream. Thanks to that performance bug, it’s sometime much less expensive to use synchronous calls… V. Volkov did some experiments to measure the kernel launch overhad in his SC’08 paper, with asynchronous transfers, he reported 3-7Âµs of overhead, and 10-14Âµs for synchronous calls. Even if this number are pretty old now, they are still quite relevant.
Unfortunately, it seems that not enough of us are asking for better latency everywhere in CUDA…