Computation and PCIe tranfers overlaping with callbacks and events.

Hello

We have a toy example code where we try to reproduce some runtime behaviors we observed in our aplication.

We understand the following:

1 When we add a callback in a stream, anything after it will wait until the callback finishes.
2 If after the callback we add an event, and we force another stream to wait on that event, any thing after the wait event on that second stream, will wait until the callback is finished.

But we found some situations, that we don’t know how to explain, or how to interpret wat can we expect from the CUDA runtime or what are the reasons for that behavior.

Here is the code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <cassert>

#define BUFFER_SIZE 20000000

__global__ void kernelDummy1() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

__global__ void kernelDummy2() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

__global__ void kernelDummy3() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

void CUDART_CB MyCallback (cudaStream_t stream, cudaError_t status, void *userData){
    _sleep(0);
}

int main() {
    cudaStream_t streamA, streamB, streamC, streamD;
    cudaError_t err;
    
    int* buffer = 0;

    err = cudaMallocHost(&buffer, sizeof(int) * BUFFER_SIZE, 0);
    assert(err == cudaSuccess);

    for (int i = 0; i < BUFFER_SIZE; i++)
        buffer[i] = i;

    int* d_bufferA;
    int* d_bufferB;

    err = cudaMalloc(&d_bufferA, BUFFER_SIZE * sizeof(int));
    assert(err == cudaSuccess);

    err = cudaMalloc(&d_bufferB, BUFFER_SIZE * sizeof(int));
    assert(err == cudaSuccess);


    cudaEvent_t eventA;
    err = cudaEventCreate(&eventA, 0);
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamA);
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamB); 
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamC);
    assert(err == cudaSuccess);
    
    err = cudaStreamCreate(&streamD);
    assert(err == cudaSuccess);
    
    //cudaDeviceSynchronize();
    for (int i = 0; i < 10; i++) {
        //comment this, we have different behaviour, and worse!!!
        kernelDummy1<<<32, 1, 0, streamA>>>();   
        for (int j = 0; j < 10; j++)
            cudaMemcpyAsync(d_bufferA, buffer, BUFFER_SIZE * sizeof(int), cudaMemcpyHostToDevice, streamA);
        
        kernelDummy3<<<32, 1, 0, streamC>>>();   
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        cudaStreamAddCallback(streamA, MyCallback, 0, 0);
        
        err = cudaEventRecord(eventA, streamA);
        assert(err == cudaSuccess);
             
        err = cudaStreamWaitEvent(streamB, eventA, 0);
        assert(err == cudaSuccess);
        
        kernelDummy2<<<32, 1, 0, streamB>>>();
//      kernelDummy<<<32, 1, 0, streamC>>>();        
        

        cudaDeviceSynchronize();
    }

    err = cudaStreamDestroy(streamA);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamB);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamC);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamD);
    assert(err == cudaSuccess);

    err = cudaFree(d_bufferA);
    assert(err == cudaSuccess);

    err = cudaFree(d_bufferB);
    assert(err == cudaSuccess);

    err = cudaFreeHost(buffer);
    assert(err == cudaSuccess);

    return 0;
}

In this example, the behavior is almost as the expected.

  • First kernel scheduled on streamA executes before the callback.
  • The execution of kernels in streamC is independent of all other streams, so they execute concurrently with the rest of the code.
  • The kernel scheduled on stream B will be executed after the callcback.

External Media

But, only commenting the first kernel launch on streamA (line 75), makes all the kernels on ALL streams to be executed after all the memory transfers finished. WHY?

Nsight:

External Media

Code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <cassert>

#define BUFFER_SIZE 20000000

__global__ void kernelDummy1() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

__global__ void kernelDummy2() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

__global__ void kernelDummy3() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

void CUDART_CB MyCallback (cudaStream_t stream, cudaError_t status, void *userData){
    _sleep(0);
}

int main() {
    cudaStream_t streamA, streamB, streamC, streamD;
    cudaError_t err;
    
    int* buffer = 0;

    err = cudaMallocHost(&buffer, sizeof(int) * BUFFER_SIZE, 0);
    assert(err == cudaSuccess);

    for (int i = 0; i < BUFFER_SIZE; i++)
        buffer[i] = i;

    int* d_bufferA;
    int* d_bufferB;

    err = cudaMalloc(&d_bufferA, BUFFER_SIZE * sizeof(int));
    assert(err == cudaSuccess);

    err = cudaMalloc(&d_bufferB, BUFFER_SIZE * sizeof(int));
    assert(err == cudaSuccess);


    cudaEvent_t eventA;
    err = cudaEventCreate(&eventA, 0);
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamA);
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamB); 
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamC);
    assert(err == cudaSuccess);
    
    err = cudaStreamCreate(&streamD);
    assert(err == cudaSuccess);
    
    //cudaDeviceSynchronize();
    for (int i = 0; i < 10; i++) {
        //comment this, we have different behaviour, and worse!!!
        //kernelDummy1<<<32, 1, 0, streamA>>>();   
        for (int j = 0; j < 10; j++)
            cudaMemcpyAsync(d_bufferA, buffer, BUFFER_SIZE * sizeof(int), cudaMemcpyHostToDevice, streamA);
        
        kernelDummy3<<<32, 1, 0, streamC>>>();   
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        cudaStreamAddCallback(streamA, MyCallback, 0, 0);
        
        err = cudaEventRecord(eventA, streamA);
        assert(err == cudaSuccess);
             
        err = cudaStreamWaitEvent(streamB, eventA, 0);
        assert(err == cudaSuccess);
        
        kernelDummy2<<<32, 1, 0, streamB>>>();
//      kernelDummy<<<32, 1, 0, streamC>>>();        
        

        cudaDeviceSynchronize();
    }

    err = cudaStreamDestroy(streamA);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamB);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamC);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamD);
    assert(err == cudaSuccess);

    err = cudaFree(d_bufferA);
    assert(err == cudaSuccess);

    err = cudaFree(d_bufferB);
    assert(err == cudaSuccess);

    err = cudaFreeHost(buffer);
    assert(err == cudaSuccess);

    return 0;
}

Following this discussion, here is another example:

If we remove both the first kernel launch and the sychronization between streams using events (just comenting out the cudaStreamWaitEvent), there is overlaping as expected, with all the kernels.

Without comenting this line, we would expect to see overlaping with the kernels schedulend on streamC but ther isn’t.

WHY?

External Media

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <cassert>

#define BUFFER_SIZE 20000000

__global__ void kernelDummy1() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

__global__ void kernelDummy2() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

__global__ void kernelDummy3() {
    int b;
    for (int j = 0; j < 100; j++)
        for (int i = 0; i < 100; i++)
            b = i * 2;
}

void CUDART_CB MyCallback (cudaStream_t stream, cudaError_t status, void *userData){
    _sleep(0);
}

int main() {
    cudaStream_t streamA, streamB, streamC, streamD;
    cudaError_t err;
    
    int* buffer = 0;

    err = cudaMallocHost(&buffer, sizeof(int) * BUFFER_SIZE, 0);
    assert(err == cudaSuccess);

    for (int i = 0; i < BUFFER_SIZE; i++)
        buffer[i] = i;

    int* d_bufferA;
    int* d_bufferB;

    err = cudaMalloc(&d_bufferA, BUFFER_SIZE * sizeof(int));
    assert(err == cudaSuccess);

    err = cudaMalloc(&d_bufferB, BUFFER_SIZE * sizeof(int));
    assert(err == cudaSuccess);


    cudaEvent_t eventA;
    err = cudaEventCreate(&eventA, 0);
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamA);
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamB); 
    assert(err == cudaSuccess);

    err = cudaStreamCreate(&streamC);
    assert(err == cudaSuccess);
    
    err = cudaStreamCreate(&streamD);
    assert(err == cudaSuccess);
    
    //cudaDeviceSynchronize();
    for (int i = 0; i < 10; i++) {
        //comment this, we have different behaviour, and worse!!!
        //kernelDummy1<<<32, 1, 0, streamA>>>();   
        for (int j = 0; j < 10; j++)
            cudaMemcpyAsync(d_bufferA, buffer, BUFFER_SIZE * sizeof(int), cudaMemcpyHostToDevice, streamA);
        
        kernelDummy3<<<32, 1, 0, streamC>>>();   
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        kernelDummy3<<<32, 1, 0, streamC>>>(); 
        cudaStreamAddCallback(streamA, MyCallback, 0, 0);
        
        err = cudaEventRecord(eventA, streamA);
        assert(err == cudaSuccess);
             
        //err = cudaStreamWaitEvent(streamB, eventA, 0);
        //assert(err == cudaSuccess);
        
        kernelDummy2<<<32, 1, 0, streamB>>>();
//      kernelDummy<<<32, 1, 0, streamC>>>();        
        

        cudaDeviceSynchronize();
    }

    err = cudaStreamDestroy(streamA);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamB);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamC);
    assert(err == cudaSuccess);

    err = cudaStreamDestroy(streamD);
    assert(err == cudaSuccess);

    err = cudaFree(d_bufferA);
    assert(err == cudaSuccess);

    err = cudaFree(d_bufferB);
    assert(err == cudaSuccess);

    err = cudaFreeHost(buffer);
    assert(err == cudaSuccess);

    return 0;
}

What is the platform? Windows/Linux? Which GPU? Which CUDA version? If on windows, is the GPU using TCC driver mode?
Are you compiling in debug mode or release mode?

Windows 8 and 10 with WDDM (no TCC, since we also do OpenGL and need a screen plugged)
Visual Studio 2012
CUDA 7.0
Quadro k4200 and Quadro M4000
Compiling in both Debug and Release gives same effect.

Have you seen this (very relevant) thread?

https://devtalk.nvidia.com/default/topic/822942/why-does-cudastreamaddcallback-serialize-kernel-execution-and-break-concurrency-/

Thankyou cbuchner1

I have seen that post before, and I hadn’t had read all of it before, because the topic is different. Now I read it and I can confirm, the topic is different.

I’m trying to understand the Compute/Transfer overlapping mechanism behavior, specifically when introducing a callback, instead of having several kernels executing at the same time.

The reason I’m not looking into overlapping kernels, is because my kernels (the real ones, not the ones in the example), usually have 100% occupancy, and as I understand, if a kernel uses most of the resources of the GPU, there won’t be kernel concurrency, it would not make sense anyway.

So, returning to the topic of this post, the documentation (in my opinion) is pretty clear about how to overlap transfers and computation.

  • For simplicity, having tasks defined as “transfer_in” + “kernels” + “transfer_out”
  • We are using Kepler and Maxwell Quadro GPU’s.
  • You need to issue first all “transfer_in” a non-default stream, and then “kernels” in another non-default stream.
  • You need to use pinned memory for the transfers.

In our example we only have the “tranfer_in” and the “kernels”, to simplify.

Then, we introduced the callback (line 85). The documentation seem’s also pretty clear about callbacks, but maybe we are missing something.

  • A callback issued on a non-default stream, will execute only after all commands issued on that stream are completed. That is equivalent to an streamSynchronice.
  • A callback will block all the following tasks ONLY on that stream, until the callback is finished. It means that it is NOT ASYNCHRONOUS to the stream where it was issued.
  • The stream will not be active until the callback has finished execution. That is not so clear. Is it just the same thing as in the previous sentence? Or does it mean (for instance) that a thread trying to issue something on that stream will block until the callback has finished?
  • The execution of different callbacks in different streams is random and can be serialized by the runtime. Meaning that some callbacks might have to wait until others have finished before they can be executed, even if they are in different streams. In our example, we only use one callback.

So, the question is:

Why are we observing the following behavior?

kernel1 => streamA
multiple copies => streamA
multiple kernel3 => streamC
callback => streamA
event record => streamA
wait on previous event => streamB
kernel2 => streamB

This sequence works as we expected, kernel3 executions are concurrent with the memory transfers and kernel2 is executed after the callback finishes.

But then:

multiple copies => streamA
multiple kernel3 => streamC
callback => streamA
event record => streamA
wait on previous event => streamB
kernel2 => streamB

With this sequence, the execution of kernel3 in streamC waits until all transfers are finished. Why?

Without the callback, the overlapping works fine again:

multiple copies => streamA
multiple kernel3 => streamC
event record => streamA
wait on previous event => streamB
kernel2 => streamB

That behavior makes us wonder what are we missing from the callback documentation?

I updated the previous post, hoping for it to be clearer.

Still waiting for an answer.

Thanks

The following manual sentence…

Adding device work to any stream does not have the effect of making the stream active until all preceding callbacks have executed. Thus, for example, a callback might use global attached memory even if work has been added to another stream, if it has been properly ordered with an event.

… may be considered as “callback is a sort of kernel launch executed on CPU, so it will wait for all preceding operations in the stream finished, and all subsequent operations will wait for its execution”

Quadro k4200 and Quadro M4000

k4200 is SM 3.0 device, it has only one queue for all kernels, so if callbacks are served as kernel launches, it may have preceded dummy streamC kernels in the queue. are you got the same behavior with maxwells?

another possibility is that after adding a callback cuda runtime tries to expose memory copy results to the streamC