Overlapping execution / data transfer & kernel execution order

hamster143 · December 10, 2015, 6:17am

I’m seeing an odd problem and I suspect that it can be traced to a problem with kernel execution orders.

Consider this scenario. We have two threads (A and B) and three streams (1 to 3).

Thread A enqueues a large asynchronous device->host data transfer in stream 1.
Thread A enqueues 100 asynchronous kernels in stream 2. They start executing.
Simultaneously, thread B enqueues a data transfer and a kernel execution in stream 3.

Would it be correct to assume that stream 3 kernel is enqueued to execute in the order it was received, and this order would remain unchanged afterwards? And is there any way to prevent this from happening?

The reason why it matters is that, if the device only has one DMA unit (e.g. GeForce), stream 3 data transfer has to wait for stream 1 to finish. So it would end up executing, say, half of stream 2 kernels, then stop everything and wait, because it needs to launch stream 3 kernel, and it has to finish stream 3 transfer for that first.

little_jimmy · December 10, 2015, 8:55am

“Would it be correct to assume that stream 3 kernel is enqueued to execute in the order it was received, and this order would remain unchanged afterwards? And is there any way to prevent this from happening?”

streams are generally asynchronous with respect to each other, and synchronous with respect to themselves
work within a stream is executed in order; however, there are little guarantees in terms of order across multiple streams
you could assign priorities to streams, but i am not certain whether this would remedy the matter

i think you have essentially, perhaps unwittingly, put forth the case that in some cases it is preferable not to schedule a large transfer all at once, but rather to schedule it in blocks
priorities assigned to streams are more likely to have effect when work in a stream can be ‘paused’ or ‘partitioned’

also, i am of the opinion that the order in which you issue work is of importance
you may have 2 threads, but in my mind, it is still serviced by 1 driver
and the driver generally services work on a first come first serve basis
this is also evident when using single threads - when issuing work within loops

hence, if the d2h transfer is of lower priority, perhaps schedule it last, instead of first
also, if thread’s A kernels depend on that of thread B, perhaps only schedule a portion of A’s kernels, before scheduling thread B’s kernel, not all at once
and if thread B’s kernel waits for a h2d transfer, that particular transfer should perhaps enjoy higher or highest priority

i would even question whether you indeed need multiple host threads; it seems to only complicate stream work prioritization

hamster143 · December 10, 2015, 11:39pm

Yes, streams are supposed to be asynchronous, that’s why this behavior got me stumped.

I’m seeing this in context of nvcuvid. One thread is internal to nvcuvid and it is in charge of doing hardware-accelerated video decoding. The other is mine and it does stuff with decoded frames. So, getting rid of two host threads isn’t an option (it would require lots of rearchitecturing).

API reference guide seems to say that stream priorities are not available on GeForce.

Topic		Replies	Views
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	1990	June 24, 2016
Cuda stream question regarding transfer, kernel execution concurrency CUDA Programming and Performance	1	383	December 19, 2022
Understanding Streams I'm confused. :( CUDA Programming and Performance	2	728	May 2, 2011
Stream Synchronization Questions CUDA Programming and Performance	1	288	January 17, 2019
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
First Set of Commands in Set of Streams not Asynchronous? CUDA Programming and Performance	0	2742	August 25, 2010
My streams are not running concurrently CUDA Programming and Performance	7	1775	March 6, 2018
Stream execution order in CUDA exercise Teaching and Curriculum Support	1	1236	February 3, 2020
Kernel Execution Sequence CUDA Programming and Performance	1	2853	May 25, 2012
a problem about the asynchronous mechanism and stream CUDA Programming and Performance	0	2550	December 5, 2008

Overlapping execution / data transfer & kernel execution order

Related topics