Hi,
I’m using CUDA 5.0 and linux. While profiling on a K10 card I’ve noticed that an empty
kernel (with no parameters and no code in it) takes ~70us as shown by NVVP.
My app is a multi-gpu one, so if I have the same kernel “running” on 4 GPUs, hence
the grid size for that empty kernel is 1/4th of the size, NVVP reposts ~30us.
Is this overhead acceptable/reasonable? Also, I was under the impression that
the compiler would remove this empty kernel and not run it…
On single-GPU systems under 64-bit Linux I typically see launch overhead for empty kernels (i.e. no code and no kernel arguments) of less than or equal to 5 us. It seems to differ a bit based on GPU type, CUDA version, and host system; the lowest times I have observed were about 3 us.
From that perspective the reported launch overheads seem larger than expected. However, I have not run with a K10, measured launch overhead when more than one GPU is present in a system, or checked exactly how NVVP reports launch overhead. I have always used my own simple app for measuring launch overhead using batches of back-to-back kernel launches. I wouldn’t be surprised if NVVP needs to add instrumentation to launches which then increases the launch overhead, so you may want to run your own test.
I have never seen the compiler optimize out empty kernel launches. I am not sure this would be a good idea (for one thing, it would certainly eliminate a convenient way of measuring the minimum launch overhead :-)