Stream Job Scheduling

lars · November 25, 2009, 3:10am

Hi,

How are asynchronous jobs in different streams scheduled? Lets say I have 100 kernel/memCpyAsync’s jobs posted and ready to launch in “stream0”. If I post a single job in “stream1”, is that job going to have to wait for all jobs in stream0 to finish, or are jobs from different streams launched in some kind of round-robin fashion (assuming that there are jobs ready to launch in both streams).

Thanks,

Lars

Smokey · November 25, 2009, 3:21am

My understanding is that, in your case, stream1 will get a turn before stream0 finishes.

Though it’s not documented, I always assumed (eep) that streams were essentially a FIFO list of ‘jobs’, each stream would pull the ‘next’ job from the list, send it to the GPU scheduler, until that job was complete - and then repeat with the next job (until the stream’s FIFO is empty).

As for how jobs are scheduled (when you have multiple streams sending jobs to a single GPU) - I’m assuming that’s up to the driver/hardware scheduler, and less to do with streams themselves.

lars · November 25, 2009, 11:24pm

Thanks for the reply. I hope that if you have one stream with lots of queued up kernels and launch a single async memcpy in an other stream, at least this memcpy will get a change to launch before all the kernels in the other stream finish.

ONeill · November 26, 2009, 9:24am

I have written a small test app, where i put commands into 2 streams and measured the time they need to execute them. This I compared to a regular stream-0 sequential execution of all those commands. Somehow the order i put cmds into the 2 streams makes up a significant difference!
If i do not alternate the streams which i give commands to - e.g. streamA: memcpy, kernel, memcpy; streamB memcpy, kernel, memcpy - it takes way more time than for a simple sequential execution!

Here are some results of my test:
2 streams, alternating: 288.23 ms
2 streams, not alternating: 384.07 ms
sequential: 356.04 ms

So i think there IS a difference between completely feeding one stream after another and alternating them after each command!

jack · November 26, 2009, 4:23pm

The alternating version is faster because on some hardware (I can’t remember what the requirements are, but you can look them up on the forum), the driver can simultaneously do a memcpy + kernel execution, as long as they are from different streams.

Now, as for why the non-alternating version is slower than the non-streamed, sequential version…I can’t answer that, but a wild guess would be that there is some overhead where the driver is trying to figure out which kernel to run first, or there is some optimization where the driver determines that there is no streaming (other than the default stream 0) and can execute things a bit faster.

lars · November 27, 2009, 12:00pm

From the limited tests I’ve just did, this doesn’t seem to be the case…

If I launch something like this (S=stream)

S0: memcpyAsyncH2D

S0: kernel1

S0: kernel2

…

S0: kernelN

S1: memcpyAsyncH2D

S1: kernel1

S1: kernel2

…

S1: kernelN

S0: memcpyAsyncD2H

In this case, the last D2H memcpy in S0 seems to wait for all or most of the operations in S1 to finish before launching, although in an ideal world, it would be overlapped with the kernel calls in S1 and possibly finished long before stream S1 would be done. Any ideas why this doesn’t seem to happen? (except that we don’t live in an ideal world).

/L

Smokey · November 29, 2009, 10:13pm

Either the current implementation of streams in cuda is flawed - or your test case is flawed.

But all things considered it’s pretty trivial to test (after queueing all the stream commands, all you should be doing is looping testing the status of each stream to determine which one finishes first - or alternatively if you want more detail, using events to determine which parts of which streams are reached first)

I’m assuming it’s just poor implementation in cuda.

Topic		Replies	Views
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	2203	June 24, 2016
Ordering of cudaMemcpyAsync issued to separate streams CUDA Programming and Performance	4	666	February 5, 2019
cuda stream CUDA Programming and Performance	3	5915	April 6, 2011
Overlapping execution / data transfer & kernel execution order CUDA Programming and Performance	2	745	December 10, 2015
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	454	November 18, 2024
Help with CUDA streams CUDA Programming and Performance	1	1652	April 2, 2010
cuda (Newbie question) when using streams, does the order of the Async calls make a difference? CUDA Programming and Performance	1	580	December 5, 2010
Streams in different compute capabilities CUDA Programming and Performance	0	3461	June 13, 2010
kernel launches in the same stream CUDA Programming and Performance	4	5328	September 22, 2010
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	539	October 12, 2021

Stream Job Scheduling

Related topics