Overlapping execution / data transfer & kernel execution order

I’m seeing an odd problem and I suspect that it can be traced to a problem with kernel execution orders.

Consider this scenario. We have two threads (A and B) and three streams (1 to 3).

  • Thread A enqueues a large asynchronous device->host data transfer in stream 1.
  • Thread A enqueues 100 asynchronous kernels in stream 2. They start executing.
  • Simultaneously, thread B enqueues a data transfer and a kernel execution in stream 3.

Would it be correct to assume that stream 3 kernel is enqueued to execute in the order it was received, and this order would remain unchanged afterwards? And is there any way to prevent this from happening?

The reason why it matters is that, if the device only has one DMA unit (e.g. GeForce), stream 3 data transfer has to wait for stream 1 to finish. So it would end up executing, say, half of stream 2 kernels, then stop everything and wait, because it needs to launch stream 3 kernel, and it has to finish stream 3 transfer for that first.

“Would it be correct to assume that stream 3 kernel is enqueued to execute in the order it was received, and this order would remain unchanged afterwards? And is there any way to prevent this from happening?”

streams are generally asynchronous with respect to each other, and synchronous with respect to themselves
work within a stream is executed in order; however, there are little guarantees in terms of order across multiple streams
you could assign priorities to streams, but i am not certain whether this would remedy the matter

i think you have essentially, perhaps unwittingly, put forth the case that in some cases it is preferable not to schedule a large transfer all at once, but rather to schedule it in blocks
priorities assigned to streams are more likely to have effect when work in a stream can be ‘paused’ or ‘partitioned’

also, i am of the opinion that the order in which you issue work is of importance
you may have 2 threads, but in my mind, it is still serviced by 1 driver
and the driver generally services work on a first come first serve basis
this is also evident when using single threads - when issuing work within loops

hence, if the d2h transfer is of lower priority, perhaps schedule it last, instead of first
also, if thread’s A kernels depend on that of thread B, perhaps only schedule a portion of A’s kernels, before scheduling thread B’s kernel, not all at once
and if thread B’s kernel waits for a h2d transfer, that particular transfer should perhaps enjoy higher or highest priority

i would even question whether you indeed need multiple host threads; it seems to only complicate stream work prioritization

Yes, streams are supposed to be asynchronous, that’s why this behavior got me stumped.

I’m seeing this in context of nvcuvid. One thread is internal to nvcuvid and it is in charge of doing hardware-accelerated video decoding. The other is mine and it does stuff with decoded frames. So, getting rid of two host threads isn’t an option (it would require lots of rearchitecturing).

API reference guide seems to say that stream priorities are not available on GeForce.