I’m seeing an odd problem and I suspect that it can be traced to a problem with kernel execution orders.
Consider this scenario. We have two threads (A and B) and three streams (1 to 3).
- Thread A enqueues a large asynchronous device->host data transfer in stream 1.
- Thread A enqueues 100 asynchronous kernels in stream 2. They start executing.
- Simultaneously, thread B enqueues a data transfer and a kernel execution in stream 3.
Would it be correct to assume that stream 3 kernel is enqueued to execute in the order it was received, and this order would remain unchanged afterwards? And is there any way to prevent this from happening?
The reason why it matters is that, if the device only has one DMA unit (e.g. GeForce), stream 3 data transfer has to wait for stream 1 to finish. So it would end up executing, say, half of stream 2 kernels, then stop everything and wait, because it needs to launch stream 3 kernel, and it has to finish stream 3 transfer for that first.