Streams not running conccurently

Hi everyone,
I was wondering if anyone had any idea why my streams are not running concurrently.
Here is the context:

BLOCKSIZE: 1
GRIDSIZE: 8

FunctionA(streamA){
Kernel1<BLOCK SIZE ,GRID SIZE, .., streamA>(...)
Kernel2<BLOCK SIZE ,GRID SIZE, .., streamA>(...)
Kernel3<BLOCK SIZE ,GRID SIZE, .., streamA>(...)
}

FunctionB(streamB){
Kernel2<BLOCK SIZE ,GRID SIZE,.., streamB>(...)
Kernel3<BLOCK SIZE ,GRID SIZE, ..., streamB>(...)
}

run(){
    //create streams
    streams = new cudaStream_t[32];
    for(int i = 0; i < num_streams; i++) {
      cudaStreamCreate(&streams[i]);
        }

    num_iterations = 150

    for(int i = 0: i<num_iterations: i++){
        functionB( 0 );
        for(int j = 0: j<i : j++){
             int stream_id = (j % (num_streams - 1)) + 1; // Skip stream 0 
             functionA(stream_id);

           }

    }
    for (int s = 1; s < num_streams; s++) {
        cudaStreamSynchronize(streams[s]);
        
    }

Each functionA iteration should run in parallel within the nested for loop. within functionA and function B, 2 or 3 kernels run sequentially though. So it is a group of kernels running sequentially which we need to run in parallel.
Now I would expect, this to run in parallel streams, but I they all run in sequential order but in different streams. Why is that? Nothing I do seems to work. I modified the grid size and blocksize for each kernel, but nothing. Hopefully someones can help.
Thank you.

Is there any special reason why you’re calling FunctionB() with the default stream? As far as I understood it, workloads being processed in the default (0) stream can’t run concurrently with workloads in other streams.

Meaning: Whenever a workload within the default stream is executed, any other workloads will be stalled until the work in the default stream has finished. So, the way I see it, you’ll have to call FunctionB() with a non-default stream, too, if you want to achieve “reliable” concurrency.

In addition, kernels launched within the same stream (e. g. Kernel2 and Kernel3 in FunctionB()) will always run sequentially. But this should be obvious as it’s true for non-default streams as well as for the default stream.

By my read of the code, that is not the default stream. It is the stream that was created that corresponds to index 0 in the array where created stream handles are stored.

That may be what was meant. But by the looks of the code, at least FunctionB() clearly launches its kernels using the stream 0. Or am I mistaken here?

[EDIT:]
Okay, you can’t really tell. In respect to how the functions and their calls are working together, this seems to be more of pseudo code than actually executable code.

Sorry, you are correct. I misread the code. That is really how the indicated code for Function B would work (although the code overall doesn’t appear to be syntactically sensible.) If we take that literal interpretation, then the loop that calls FunctionA is doing illegal things.

FunctionA has a similar definition to FunctionB, in terms of its stream usage. If we take the direct interpretation, then the loop is calling FunctionA with streams of 1, 2, 3, etc.

That is illegal. You must actually use a stream handle when invoking a kernel, not an integer. The usage of the NULL stream is a special case where the usage of the literal 0 will select the NULL stream

1 Like