CUDA 7.0, Maxwell (Titan X)
Create 2 non-blocking queues HostQ1, HostQ2
Async launch kernel DyParK into queue HostQ1
Query HostQ1 to make sure DyParK starts
DyParK creates a non-blocking queue DeviceQ1 and periodically queues small kernels into it
After about 500 ms, all children of DyParK complete and DyParK exits
However, the host code waits for 100 ms (DyParK and children are running) and async queues a few hundred small kernels into HostQ2 with a queue query after each one to make sure they start.
I hoped that there would be some overlap of the device and host launched kernels but the visual profiler shows there is not. All the host launches into HostQ2 are queued long before DyParK finishes, but the first one commences execution immediately after DyParK finishes.
DyParK is a single block and according to the profiler, there are large gaps of time where none of the device launched kernels are running.
Why are none of the host launched kernels executing until after the DyParK kernel finishes?