order of number of overhead of 1k kernels? kernel startup overhead

Hi, I have to call something like this:
for (0~32)
for(0~32)
func(kernel1(i,j), kernel2(i,j));
the 2 “for” loops can’t be combined. each kernel costs 20millisec or so.
May I ask: what’s the order of number of time overhead for so many kernels?
Does the overhead outweigh the kernels(20ms) themselves? Thanks!

no one replies? help@@

I ran some kernel invocation overhead tests last week and found that the invocation overhead has a lot to do with the grid and block sizes and dimensionality involved. I ran one test with a large number of 2-D blocks which gave me a peak “null” kernel invocation rate of only 512 per second. With 1-D blocks and grid of a smaller size, I get numbers as high as 55,000 kernel invocations per second. So, the answer currently appears to be “it depends”. I don’t know which factors play the biggest role as I haven’t had time to go further with testing yet.

John

I just ran some more tests in both 1-D and 2-D blocks/grids and it seems to me that the main issue in kernel invocation overhead is probably the number of blocks. If the block size doesn’t allow the number of threads to divide into 768, that may also be an issue, but my quick test code seems to show a decline in kernel invocation rate that primarily corresponds with the number of blocks. It appears to me that once you have more than 64 blocks, they are no longer “free”, and you begin to incur additional scheduling overhead as the number of blocks continues to increase beyond 64. In my test code, I get 50,000 kernels/sec for 16x16 thread blocks with 256 blocks in a square grid. I get 40,000 kernels/sec at 1024 blocks, 20,000 kernels/sec for 4096 blocks, and 10,000 kernels/sec for 16384 blocks. These kernels are doing no work at all and are just calling “return”. If your kernels are doing any sort of useful work at all, I would think that having 1024 blocks would be perfectly fine.

John

Thanks very much for your analysis and suggestions!