order of number of overhead of 1k kernels? kernel startup overhead

yk_cadcg · March 15, 2007, 11:01am

Hi, I have to call something like this:
for (0~32)
for(0~32)
func(kernel1(i,j), kernel2(i,j));
the 2 “for” loops can’t be combined. each kernel costs 20millisec or so.
May I ask: what’s the order of number of time overhead for so many kernels?
Does the overhead outweigh the kernels(20ms) themselves? Thanks!

yk_cadcg · March 17, 2007, 2:37pm

no one replies? help@@

tachyon_john · March 17, 2007, 7:16pm

I ran some kernel invocation overhead tests last week and found that the invocation overhead has a lot to do with the grid and block sizes and dimensionality involved. I ran one test with a large number of 2-D blocks which gave me a peak “null” kernel invocation rate of only 512 per second. With 1-D blocks and grid of a smaller size, I get numbers as high as 55,000 kernel invocations per second. So, the answer currently appears to be “it depends”. I don’t know which factors play the biggest role as I haven’t had time to go further with testing yet.

John

tachyon_john · March 17, 2007, 7:41pm

I just ran some more tests in both 1-D and 2-D blocks/grids and it seems to me that the main issue in kernel invocation overhead is probably the number of blocks. If the block size doesn’t allow the number of threads to divide into 768, that may also be an issue, but my quick test code seems to show a decline in kernel invocation rate that primarily corresponds with the number of blocks. It appears to me that once you have more than 64 blocks, they are no longer “free”, and you begin to incur additional scheduling overhead as the number of blocks continues to increase beyond 64. In my test code, I get 50,000 kernels/sec for 16x16 thread blocks with 256 blocks in a square grid. I get 40,000 kernels/sec at 1024 blocks, 20,000 kernels/sec for 4096 blocks, and 10,000 kernels/sec for 16384 blocks. These kernels are doing no work at all and are just calling “return”. If your kernels are doing any sort of useful work at all, I would think that having 1024 blocks would be perfectly fine.

John

yk_cadcg · March 18, 2007, 3:10am

Thanks very much for your analysis and suggestions!

I just ran some more tests in both 1-D and 2-D blocks/grids and it seems to me that the main issue in kernel invocation overhead is probably the number of blocks. If the block size doesn’t allow the number of threads to divide into 768, that may also be an issue, but my quick test code seems to show a decline in kernel invocation rate that primarily corresponds with the number of blocks. It appears to me that once you have more than 64 blocks, they are no longer “free”, and you begin to incur additional scheduling overhead as the number of blocks continues to increase beyond 64. In my test code, I get 50,000 kernels/sec for 16x16 thread blocks with 256 blocks in a square grid. I get 40,000 kernels/sec at 1024 blocks, 20,000 kernels/sec for 4096 blocks, and 10,000 kernels/sec for 16384 blocks. These kernels are doing no work at all and are just calling “return”. If your kernels are doing any sort of useful work at all, I would think that having 1024 blocks would be perfectly fine.

John

[snapback]172581[/snapback]

Topic		Replies	Views
How big is the kernel invocation overhead? CUDA Programming and Performance	9	5201	December 17, 2008
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	8018	March 8, 2013
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16557	July 31, 2008
kernel launch overhead for GTX 280 CUDA Programming and Performance	17	3897	November 5, 2009
What's the cost of loading in blocks? CUDA Programming and Performance	3	2377	April 9, 2008
Question on number of Blocks possible CUDA Programming and Performance	3	2209	April 22, 2008
Kernel call overhead Is this overhead or am I blocking with the CPU? CUDA Programming and Performance	1	8465	December 7, 2011
Launch Overhead as a function of Kernel Size... Is it Proportional? Characterization? CUDA Programming and Performance	1	5400	June 24, 2008
Should I combine multiple kernels? or: How bad is kernel call overhead? CUDA Programming and Performance	1	964	July 1, 2009
Kernel execution overhead CUDA Programming and Performance	2	1202	July 6, 2009

order of number of overhead of 1k kernels? kernel startup overhead

Related topics