When more than two host threads call a kenel, can all host threads get control at once?

If my understading is right, concurret kernel function calls are serialized in device.
In the CUDA programmiong guide (section, it say that “Control is returned to the application before the device has completed the requested task”.

My question is that…
If multiple host threads call a kernel or differect kernel functions, what happen?

I can think two cases…

  1. All kernels return control to the host threads at same time.
  2. One kernel start to run and return control but other host threads shoue wait until the executed kernel function is finished.

Anyone who have experience like this situation?


Programming Manual says clearly that kernel launch is async, i.e. control to host is returned immediately. It will not wait for previous launches to complete unless you are trying to queue too many kernel calls (limit is 140 or so for CUDA 2.0 if I remember correctly, after which an implicit synchronization will occur).