How does the graph launch concurrency work with executable graph?

shawda.ssik600015 · August 31, 2023, 4:04am

Suppose I instantiate two executable graphs exec1 and exec2 from a graph g, what is the difference between launching them on two streams and launching the same executable graph exec1 on two different streams? Some result I encountered recently seems to imply the graph launch will be concurrent in the former case, but not in the latter one. Why is the difference?

Additionally, what happens if the two graphs being launched try to access the same memory area?

I write the code below:

__device__ int mutex;
__device__ int acc;
__device__ size_t global_flags[200];

With a kernel like this:

__global__ void test_mul(size_t frac=1){
    auto x=threadIdx.x*frac;
    int lock=1;
   
    while(lock!=0){
        lock=atomicExch(&mutex,1);
    }
    global_flags[ac]=x;
    //atomicAdd(&ac,1);
    ++ac;

    atomicExch(&mutex,0);

}

So each thread will try to obtain the lock, and then write its thread id to global_flags[ac], and increase ac. Since the incrementation of ac is completed with the lock held, I assume it is not needed to use atomicAdd, and ++ac is enough. When I launch a graph containing this kernel, or launch the same executable graph on two streams, that works, but when I launch two executable graphs on two streams, the result is random. The final ac will not be equal to the total threads, but a random number. I have to use atomicAdd instead of ++ ac in this case. Why would this happen? Is the lock somehow obtained by both sides?

The rest of the code for result demonstration is as below:

    cudaGraph_t g;
    cudaStream_t stream[2];
    cudaEvent_t a,b;
    cudaEventCreate(&a);cudaEventCreate(&b);
    for(size_t i=0;i<2;i++)
        cudaStreamCreate(stream+i);
    cudaStreamBeginCapture(stream[0],cudaStreamCaptureModeGlobal);
    test_mul<<< 1,20, 0,stream[0]>>>(1);
    cudaEventRecord(a,stream[0]);
    cudaStreamWaitEvent(stream[1],a);
    test_mul<<<1,20,0,stream[1]>>>(10);
    cudaEventRecord(b,stream[1]);
    cudaStreamWaitEvent(stream[0],b);
    test_mul<<<1,20,0,stream[0]>>>(100);
    cudaStreamEndCapture(stream[0],&g)
    cudaGraphExec_t g_exec1,g_exec2;
    cudaGraphInstantiate(&g_exec1,g,NULL,NULL,0);
    cudaGraphInstantiate(&g_exec2,g,NULL,NULL,0);

test 1:

    size_t const TEST_ARR_LEN=200;
    cudaGraphLaunch(g_exec1,stream[0]);
    cudaGraphLaunch(g_exec1,stream[1]);
    size_t res[TEST_ARR_LEN];
    cudaMemcpyFromSymbol(res,global_flags,TEST_ARR_LEN);
    for(size_t i=0;i<TEST_ARR_LEN;i++)
        std::cout<<i<<" ";

test 2:

    cudaGraphLaunch(g_exec1,stream[0]);
    cudaGraphLaunch(g_exec2,stream[1]);

Result is listed below, and the colors are used to highlight each sequence produced by the same kernel launch:
test 1 result:

test 2 result:

test 2 with atomicAdd(&ac, 1):

striker159 · August 31, 2023, 8:11am

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#using-graph-apis

A cudaGraphExec_t cannot run concurrently with itself. A launch of a cudaGraphExec_t will be ordered after previous launches of the same executable graph.

Regarding the random results, you need to ensure that the memory operations within the critical section are visible to all threads. Otherwise some threads will observe outdated values of ac. Atomics are a possible solution, another would be the explicit use of memory fences.

shawda.ssik600015 · August 31, 2023, 9:52am

So a custom lock mechanism using atomicExchange doesn’t work between streams? I thought the stream is only a kind of software abstraction, and the atomic operations on ac is not needed since each thread will only read and write it between the two atomicExchange call.

striker159 · August 31, 2023, 10:23am

This isn’t really a stream issue. Thread ordering and memory ordering are two different concepts. With the lock, only one thread at a time can modify acc . However, there is no guarantee that the modification is visible to the next thread. This is explained in the programming guide, Section 7.5. CUDA C++ Programming Guide

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread

Topic		Replies	Views
Multiple launches of a single cudaGraphExec_t executing in parallel, in contrast to documentation? CUDA Programming and Performance	9	951	March 18, 2022
Multiples launch of a single cudaGraphExec_t on the device creates a deadlock CUDA Programming and Performance cuda	2	600	August 15, 2023
Why does cudaGraphLaunch not behave as an async api CUDA Programming and Performance cuda	1	410	September 2, 2022
Concurrent kernel execution using cuda graph CUDA Programming and Performance	1	1318	June 23, 2019
Atomic operations and concurrent kernel launch CUDA Programming and Performance	0	622	December 23, 2013
Rtx3070 Why when I run the program with CUDA Graph and multiple streams, there's no parallelism observed when monitoring with nsys GPU-Accelerated Libraries cuda	6	95	September 18, 2025
What will happen when I replay a cuda graph with two streams in a new stream? CUDA Programming and Performance	9	686	May 24, 2024
Concurrent executions of streams CUDA Programming and Performance	6	511	December 19, 2022
Avoiding data race across kernel execution with cuda graphs CUDA Programming and Performance	10	221	August 16, 2024
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	757	October 8, 2016

How does the graph launch concurrency work with executable graph?

Related topics