Cost of launching a kernel function

Hi:
My current kernel function is heavily divergent. So, I am considering to decompose them into smaller kernel functions to avoid divergent and also to reduce the use of register in each kernel function. But my only worry is that whether the cost of launching a kernel function is high so I will not gain anything by doing this?
I just want to get a rough idea about the cost before spending significant time decomposing the code. Thanks!

The exact overhead varies, but a rough estimate of the launch overhead is about 20 microseconds. I’m always impressed how low this actually is.

Multiple kernels can indeed help reduce register use. But I’m curious how you’ll reduce divergence this way… perhaps by applying compaction steps to your data in one kernel for your next kernel to process?

The exact overhead varies, but a rough estimate of the launch overhead is about 20 microseconds. I’m always impressed how low this actually is.

Multiple kernels can indeed help reduce register use. But I’m curious how you’ll reduce divergence this way… perhaps by applying compaction steps to your data in one kernel for your next kernel to process?

Hi, that is really impressively short.

My code has a lot of “if statement”, but if I decompose them into many kernel functions, I think I should be able to avoid that by using each kernel function taking care of each “if statement” block. It is worthwhile to try now.

Thanks!

Hi, that is really impressively short.

My code has a lot of “if statement”, but if I decompose them into many kernel functions, I think I should be able to avoid that by using each kernel function taking care of each “if statement” block. It is worthwhile to try now.

Thanks!

Watch out, if you have data to reload at each kernel launch it will be slow.

Watch out, if you have data to reload at each kernel launch it will be slow.

If you are able to avoid divergence through decomposition into different kernels, you will most likely also be able to avoid divergence by rearranging threads so that within one warp the same codepath is executed.

If you are able to avoid divergence through decomposition into different kernels, you will most likely also be able to avoid divergence by rearranging threads so that within one warp the same codepath is executed.

That is true. I need to pay more attention.

That is true. I need to pay more attention.

I was thinking about this, but so far hasn’t figure a way doing so in my kernel function without decomposing it.

I was thinking about this, but so far hasn’t figure a way doing so in my kernel function without decomposing it.

It’s way less than 20 microseconds to launch a kernel.

It’s way less than 20 microseconds to launch a kernel.

I read in an early paper by Volkov that it was like ~4 microseconds…

I read in an early paper by Volkov that it was like ~4 microseconds…

Yeah it’s less than that now.

Yeah it’s less than that now.

Well hats off to you guys then :)