Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams

Hello!

There is a quick description of my problem and code:

A: Kernel 1 Input : din Out: dout1, dout2 @ stream1


B: Kernel 2 Input : dout1 Out: res1 @ stream1

C: Copy Input : dout2 Out: res2 @ stream 2


D: Kernel 3 Input : res1, res2 Out: res3 @ stream1

The order of the execution would be: A → (B/C) → D (B and C in async mode)

Code

// test.cu
#include <cstdio>

#define cudaErrChk(ans) { cudaAssert((ans), __FILE__, __LINE__); }
    inline void cudaAssert(cudaError_t code, const char *file, int line, bool abort=true)
    {
        if (code != cudaSuccess)
        {
            fprintf(stderr, "CUDA Error: %s %s %d\n", cudaGetErrorString(code), file, line);
            if (abort) exit(code);
        }
    }
    
  
__global__ 
void incKernel(int* dout1, int* dout2, const int *din, int N) 
{
      int idx = blockIdx.x * blockDim.x + threadIdx.x;

      if (idx < N) 
      {
          dout1[idx] = din[idx] + 1;
          dout2[idx] = din[idx] + 2;
      }
}  

__global__ 
void incKernel2(int* dout1, const int *din, int N) 
{
      int idx = blockIdx.x * blockDim.x + threadIdx.x;

      if (idx < N) 
      {
          dout1[idx] = din[idx] + 1;
      }
} 

__global__ 
void incKernel3(int* dout1, const int* din1, const int* din2, int N) 
{
      int idx = blockIdx.x * blockDim.x + threadIdx.x;

      if (idx < N) 
      {
          dout1[idx] = 2*din1[idx] + 3*din2[idx];
      }
} 

int main()
{
    int* hostin;
    int* din;
    int* dout1;
    int* dout2;
    
    int* res1;
    int* res2;
    int* res3;
    
    int  datanum = 2000000;

    hostin = new int[datanum];
    
    for(int i=0; i<datanum; ++i) hostin[i] = i;
    
    cudaErrChk(cudaMalloc((void**)&din,   datanum*sizeof(int)));
    cudaErrChk(cudaMalloc((void**)&dout1, datanum*sizeof(int)));
    cudaErrChk(cudaMalloc((void**)&dout2, datanum*sizeof(int)));
    
    cudaErrChk(cudaMalloc((void**)&res1, datanum*sizeof(int)));
    cudaErrChk(cudaMalloc((void**)&res2, datanum*sizeof(int)));
    cudaErrChk(cudaMalloc((void**)&res3, datanum*sizeof(int)));
    
    cudaErrChk(cudaMemcpy(din, hostin, datanum*sizeof(int), cudaMemcpyHostToDevice));
    
    int thnum = 512;
    int bnum = (datanum + thnum - 1)/thnum;
    
    cudaStream_t stream1;
    cudaStream_t stream2;
    
    cudaEvent_t event1;
    cudaEventCreate(&event1);
    
    cudaErrChk(cudaStreamCreate(&stream1));
    cudaErrChk(cudaStreamCreate(&stream2));
    
    incKernel<<<bnum, thnum, 0, stream1>>>(dout1, dout2, din, datanum);
    
    cudaErrChk(cudaMemcpyAsync(res2, dout2, datanum*sizeof(int), cudaMemcpyDeviceToDevice, stream2));
        
    incKernel2<<<bnum, thnum, 0, stream1>>>(res1, dout1, datanum);
    
    cudaEventRecord(event1, stream2);
    cudaEventSynchronize(event1);
        
    incKernel3<<<bnum, thnum, 0, stream1>>>(res3, res1, res2, datanum);
    
    int* hostres;
    hostres = new int[datanum];
    
    cudaErrChk(cudaMemcpy(hostres, res3, datanum*sizeof(int), cudaMemcpyDeviceToHost));
    
    for(int i=0; i<datanum; ++i)
    {
        int ana = 2*(i+2)+3*(i+2);
        if(hostres[i] != ana)
            printf("Err Num: %d Data: %d Ana: %d\n", i, hostres[i], ana);
    }
        
        
    delete[] hostin;
    delete[] hostres;
    
    cudaFree(din);
    cudaFree(dout1);
    cudaFree(dout2);
    
    cudaFree(res1);
    cudaFree(res2);
    cudaFree(res3);

}

Built by

nvcc -o test -O0 test.cu

Run by

On Ubuntu 20.04 + cuda 12.2 + gcc 9.4 + RTX 4070 Ti

compute-sanitizer ./test

Results

The last part reports incorrect results and claims the failure of my attempt with asynchronous operations between the incKernel2 and the cudaMemcpyAsync functions.

Would you mind helping clarify whether cudaEventRecord(event1, stream2) checks if stream2 is idle or something else? If it is idle, will cudaEventSynchronize allow the host to proceed, or will it result in busy waiting for stream2?

Thank you!

C and A can execute at the same time. And in fact C can finish before A has started.

Your attempt to use streams looks broken to me. streams, and their basic usage and rules are covered in section 7 of this online training series

CUDA activity issued into separate created streams have no defined ordering with respect to each other, prescribed by CUDA.

This is one of two basic rules of stream usage. Therefore we can observe that since A and C are issued into separate created streams in your code, Any of the following orders are possible:

  1. A first, then C
  2. C first, then A
  3. some sort of overlap between C and A.

There is nothing in your code to prevent any of the above 3 possiblities.

Yes, it does, roughly speaking. So the event will not trigger any activity until stream2 processing reaches the point at which the vent was recorded.

Yes, cudaEventSynchronize halts the host thread until the event referenced has been “reached” in the stream processing. i.e. when all prior operations issued into that stream, before the event was recorded, have completed.

Thank you Robert Crovella so much!

My updated code is:

    cudaStream_t stream1;
    cudaStream_t stream2;
    
    cudaEvent_t event1;
    cudaEvent_t event2;
    cudaEventCreate(&event1);
    cudaEventCreate(&event2);
    
    cudaErrChk(cudaStreamCreate(&stream1));
    cudaErrChk(cudaStreamCreate(&stream2));
    
    incKernel<<<bnum, thnum, 0, stream1>>>(dout1, dout2, din, datanum);
    cudaEventRecord(event1, stream1);
    
    cudaStreamWaitEvent(stream2, event1);
    
    cudaErrChk(cudaMemcpyAsync(res2, dout2, datanum*sizeof(int), cudaMemcpyDeviceToDevice, stream2));
    cudaEventRecord(event2, stream2);
    
    incKernel2<<<bnum, thnum, 0, stream1>>>(res1, dout1, datanum);
    
    cudaEventSynchronize(event2);
    incKernel3<<<bnum, thnum, 0, stream1>>>(res3, res1, res2, datanum);

Revision

I added a new event on stream1 and called the cudaStreamWaitEvent on stream2 to synchronize them, and the program now works as expected.

May I ask a little bit more about varying the inter-stream synchronization?

As noted by the CUDA API, there are other functions like cudaStreamSynchronize that can achieve stream synchronization similar to cudaEventSynchronize or cudaStreamWaitEvent.

cudaEventSynchronize works at the host level, while the other two functions work at a fine-grained level (can be asynchronous with respect to the host).

Could you please explain if cudaStreamSynchronize is preferred over cudaEventSynchronize in terms of low-level execution? I am concerned about the overhead of event-related operations in more complex scenarios.

Thank you!

cudaStreamSynchronize is not asynchronous with respect to the host.

cudaEventSynchronize causes the host thread to wait until the referenced event is reached, as already described.

cudaStreamSynchronize causes the host thread to wait until the referenced stream is idle (all previously issued work in that stream has finished processing on the GPU).

In this respect they are similar - they are potentially host thread blocking.

cudaStreamWaitEvent as you have already suggested has a somewhat different behavior. It causes a particular stream processing on the GPU to wait until a referenced event, recorded in another stream, has been “reached” in that other stream’s processing.

So launching an event into a stream and then doing cudaEventSynchronize on that event is equivalent to cudaStreamSynchronize, and speaking for myself I personally would prefer the cudaStreamSynchronize approach unless I had some other need for the event in question, such as for timing or some other purpose.

Regarding this:

I’m not sure what “preferred … in terms of low-level execution” means. I’m not sure what “the overhead of event-related operations in more complex scenarios” means. I would prefer cudaStreamSynchronize just from a general resource principle, unless the usage of events was otherwise needed.

A “general resource principle” means “don’t unnecessarily use resources”. As I have already indicated, for a specific use case, cudaStreamSynchronize and the use of cudaEventSynchronize could be equivalent in terms of their program behavior.

cudaStreamSynchronize achieves this without the need for extra functionality. cudaEventSynchronize achieves this with extra functionality - the creation of an event, which presumably uses resources (to create it, at least). The general principle says to me “don’t incur that cost unless it is needed”.

However I do not have a specific description of what “that cost” is or how to judge its impact. It’s a general principle that works for me, and makes sense to me. Do as you wish of course. Functionally/behaviorally, I would claim there is approximately no difference between cudaStreamSynchronize and launching/recording an event into a stream and then doing cudaEventSynchronize.