one asynchronous kernel start needs about 10 µs host time, how much time need N kernel starts?
Does it scale with N, so it results in (10µs * N) host time for N kernels?
Is there a maximum for N after that the host time increases non-linear?
If there is maximum for N, where can I look it up? (device properties, specification, experimental, … )
I tried to start different numbers of asynchrounous kernels at the same time in a for loop and measured the time for starting on host.
For about N<1000 asynchronous kernel starts the requered host time is N*10µs
For N>1000 starts the host time increases non-linear (very fast) and for N = 1100 the host time is nearly equal to kernel run-time.
The problem seems to be very trivial but I could not find any answers yet.
I guess that the GPU has a kernel scheduler which schedules about 1024 kernels at the same time.
Just write a very heavy kernel, and keep launching that kernel asynchronously timing it until you detect the launch is significantly bigger than Niter * 10ms.
My program starts at the beginning much more then 1000 kernels and defines with help of streams, events and cudaWaitStreamForEvents dependencies between them.
While the started kernels are working I wanted to utilize the CPU for a different task. But if there is a limit to the queue size my approach is not suitable.
At least i have to decrease the number of asynchronous kernel starts.
I guess the kernel queue is not designed to get all the work at one time.
Is there any other techique where i can define an “inactive” queue, to add kernels to that queue and to start this queue sometime later?
You could submit 100 async kernels, then record event, then submit another 100 kernels and start your cpu work poking for an event. When event get’s signaled, you record it again (to track the second 100 pack you submitted earlier) and submit the next 100 kernels, and so on. Note that while you do this, gpu will always be busy since you always submit 100 kernels ahead.
You could also go with multithreading to keep things even simpler - 1 thread submits kernels (and get’s blocked whenever the driver decides), another does cpu calculations. You can even bumb the priority of the gpu thread to push more kernels exactly the moment there is some empty space in the queue.
By default, the stall is a spin-wait with 100% CPU utilization. You can change the default though. Lookup cudaSetDeviceFlags in the reference manual. Be warned that using the blocking sync dramatically increases the latency at which the event on the GPU is noticed on the CPU (from 5 microseconds to 200+)
I think it is controlled by a parameter when you create a context with cuCtxCreate (in driver api, not sure what it is if you are not using driver api)