How are asynchronous jobs in different streams scheduled? Lets say I have 100 kernel/memCpyAsync’s jobs posted and ready to launch in “stream0”. If I post a single job in “stream1”, is that job going to have to wait for all jobs in stream0 to finish, or are jobs from different streams launched in some kind of round-robin fashion (assuming that there are jobs ready to launch in both streams).
My understanding is that, in your case, stream1 will get a turn before stream0 finishes.
Though it’s not documented, I always assumed (eep) that streams were essentially a FIFO list of ‘jobs’, each stream would pull the ‘next’ job from the list, send it to the GPU scheduler, until that job was complete - and then repeat with the next job (until the stream’s FIFO is empty).
As for how jobs are scheduled (when you have multiple streams sending jobs to a single GPU) - I’m assuming that’s up to the driver/hardware scheduler, and less to do with streams themselves.
Thanks for the reply. I hope that if you have one stream with lots of queued up kernels and launch a single async memcpy in an other stream, at least this memcpy will get a change to launch before all the kernels in the other stream finish.
I have written a small test app, where i put commands into 2 streams and measured the time they need to execute them. This I compared to a regular stream-0 sequential execution of all those commands. Somehow the order i put cmds into the 2 streams makes up a significant difference!
If i do not alternate the streams which i give commands to - e.g. streamA: memcpy, kernel, memcpy; streamB memcpy, kernel, memcpy - it takes way more time than for a simple sequential execution!
Here are some results of my test:
2 streams, alternating: 288.23 ms
2 streams, not alternating: 384.07 ms
sequential: 356.04 ms
So i think there IS a difference between completely feeding one stream after another and alternating them after each command!
The alternating version is faster because on some hardware (I can’t remember what the requirements are, but you can look them up on the forum), the driver can simultaneously do a memcpy + kernel execution, as long as they are from different streams.
Now, as for why the non-alternating version is slower than the non-streamed, sequential version…I can’t answer that, but a wild guess would be that there is some overhead where the driver is trying to figure out which kernel to run first, or there is some optimization where the driver determines that there is no streaming (other than the default stream 0) and can execute things a bit faster.
From the limited tests I’ve just did, this doesn’t seem to be the case…
If I launch something like this (S=stream)
In this case, the last D2H memcpy in S0 seems to wait for all or most of the operations in S1 to finish before launching, although in an ideal world, it would be overlapped with the kernel calls in S1 and possibly finished long before stream S1 would be done. Any ideas why this doesn’t seem to happen? (except that we don’t live in an ideal world).
Either the current implementation of streams in cuda is flawed - or your test case is flawed.
But all things considered it’s pretty trivial to test (after queueing all the stream commands, all you should be doing is looping testing the status of each stream to determine which one finishes first - or alternatively if you want more detail, using events to determine which parts of which streams are reached first)
I’m assuming it’s just poor implementation in cuda.