Stream Ordering

Does Stream 0 (Default stream), offer any guarantee of execution ordering over higher order streams, which I understand can run in any order?

I’m looking at doing the following pseudo code:

for(i = 0; i < 1024; i++) {
    kern1<<<Stream 0>>>
    for(j = 1; j < 9; j++){
         kern2<<<Stream j>>>
         kern3<<<Stream j>>>
         kern4<<<Stream j>>>
    }
}

Kernels 2, 3 and 4 all depend on kernel 1 having completed prior to running.

Is this safe?

Yes, it is safe. If we ignore the possibility to modify default stream behavior then with respect to a particular device, activity issued into the default stream will prevent any activity subsequently issued into other streams from occurring until the default stream activity is complete. Furthermore, items issued into the default stream will not begin until all other activity issued to that device is complete.

All streams (and events) have an implicit device association, which means that the above statements basically do not apply for work issued to separate devices.

3 rules for stream semantics:

  1. Items issued into the same stream will execute in issue order.
  2. Items issued into separate, non-default streams will have no ordering prescribed by CUDA
  3. Items issued into a legacy default stream will not begin until all previously issued activity to that device is complete, and will prevent any subsequently issued activity from beginning, until that item is complete.

Again, all of the above statements are with respect to a particular device.

For clarity, to give an example based on your pseudo code, this should run as expected/shown when compiled and run this way:

$ cat t51.cu
#include <cstdio>

__device__ int d = 0;
__global__ void kern1(){
        d++;
}

__global__ void kern2(int i){
        if (d != i+1) printf("oops: d: %d, i: %d\n", d, i);
}

int main(){
    cudaStream_t str;
    cudaStreamCreate(&str);
    for(int i = 0; i < 1024; i++) {
      kern1<<<1,1>>>();
      for(int j = 0; j < 9; j++){
         kern2<<<1,1,0,str>>>(i);
         kern2<<<1,1,0,str>>>(i);
         kern2<<<1,1,0,str>>>(i);
      }
    }
    cudaDeviceSynchronize();
}
$ nvcc -o t51 t51.cu
$ ./t51
$
1 Like

Thanks Robert for the comprehensive reply. One minor observation.

In your worked example, you started the inner loop at zero instead of one. Was this intentional? The only issue I’d see starting the first round in stream 0, would be a lack of overlap with the following one.

Later: Oops, I hadn’t read it properly. Does cudaStreamCreate start at stream 1?

You can start your inner loop at any value you want. It has no connection to the streams usage. (Try it!)

A stream is a stream. it doesn’t start at “stream 1” or anything else. Every created stream is simply logically separate from every other created stream. There is no ordering among streams, nor are they counted or given ordinal values, from the programmers perspective. The stream variable is an opaque handle.

Yes, if you look in the profiler, you will see the streams numbered. But that has essentially no direct connection to the source code and is not a programmer-visible concept. And generally, in my experience, neither created streams nor the default stream are given the ordinal 1, when viewed in the profiler.

Thanks. Am new to all this and was looking at the code sample here: Programming Guide :: CUDA Toolkit Documentation and thought they had to be specified.

Ah, I get it. You observed that the default stream is sometimes called the null stream or stream 0. Therefore it is logical to conclude that there is some kind of numbering. However there isn’t. So null or stream 0 is used to distinguish it from created streams, but doesn’t have to do with ordering. Unless you feel that the 3 stream semantic rules I stated communicate some form of “ordering”. Nevertheless, there isn’t really a counting or numbering of streams, except as viewed from the profiler (and except for “stream 0”, if you happen to call the default stream that).

1 Like