host time for N asynchronous kernel starts?


one asynchronous kernel start needs about 10 µs host time, how much time need N kernel starts?
Does it scale with N, so it results in (10µs * N) host time for N kernels?
Is there a maximum for N after that the host time increases non-linear?
If there is maximum for N, where can I look it up? (device properties, specification, experimental, … )

I tried to start different numbers of asynchrounous kernels at the same time in a for loop and measured the time for starting on host.

For about N<1000 asynchronous kernel starts the requered host time is N*10µs

For N>1000 starts the host time increases non-linear (very fast) and for N = 1100 the host time is nearly equal to kernel run-time.

The problem seems to be very trivial but I could not find any answers yet.
I guess that the GPU has a kernel scheduler which schedules about 1024 kernels at the same time.

GeForce GTX 470 (compute capability 2.0)

On my pc kernel launch time is about 5mcs (win7x64). And yes, there is a queue of kernels launched, which when filled up, will stall cpu.

Thank you segeyn.

Any idea where to look up this queue size? Is it possible to query its current capacity during runtime?

Just write a very heavy kernel, and keep launching that kernel asynchronously timing it until you detect the launch is significantly bigger than Niter * 10ms.

I also think you should not rely on the queue size in the code, since it may be different for different driver versions.

Thank you sergeyn.

My program starts at the beginning much more then 1000 kernels and defines with help of streams, events and cudaWaitStreamForEvents dependencies between them.

While the started kernels are working I wanted to utilize the CPU for a different task. But if there is a limit to the queue size my approach is not suitable.

At least i have to decrease the number of asynchronous kernel starts.

I guess the kernel queue is not designed to get all the work at one time.

Is there any other techique where i can define an “inactive” queue, to add kernels to that queue and to start this queue sometime later?

You could submit 100 async kernels, then record event, then submit another 100 kernels and start your cpu work poking for an event. When event get’s signaled, you record it again (to track the second 100 pack you submitted earlier) and submit the next 100 kernels, and so on. Note that while you do this, gpu will always be busy since you always submit 100 kernels ahead.

You could also go with multithreading to keep things even simpler - 1 thread submits kernels (and get’s blocked whenever the driver decides), another does cpu calculations. You can even bumb the priority of the gpu thread to push more kernels exactly the moment there is some empty space in the queue.

That is a good idea, I’ll try both approaches.

If I do the thread approach, and the thread gets stalled cause of full kernel queue, how busy is the CPU where the thread runs? 100%?

I.e. is the stalling a busy waiting or idle waiting, e.g some kind of poking? How can the idleness be measured?

By default, the stall is a spin-wait with 100% CPU utilization. You can change the default though. Lookup cudaSetDeviceFlags in the reference manual. Be warned that using the blocking sync dramatically increases the latency at which the event on the GPU is noticed on the CPU (from 5 microseconds to 200+)

I think it is controlled by a parameter when you create a context with cuCtxCreate (in driver api, not sure what it is if you are not using driver api)

Thank you DrAnderson42 and sergeyn,

I’ll try the cudaDeviceSetFlags function and I’ll look for the functionality in the driver api and summarize my results!