How does the graph launch concurrency work with executable graph?

Suppose I instantiate two executable graphs exec1 and exec2 from a graph g, what is the difference between launching them on two streams and launching the same executable graph exec1 on two different streams? Some result I encountered recently seems to imply the graph launch will be concurrent in the former case, but not in the latter one. Why is the difference?

Additionally, what happens if the two graphs being launched try to access the same memory area?

I write the code below:

__device__ int mutex;
__device__ int acc;
__device__ size_t global_flags[200];

With a kernel like this:

__global__ void test_mul(size_t frac=1){
    auto x=threadIdx.x*frac;
    int lock=1;
   
    while(lock!=0){
        lock=atomicExch(&mutex,1);
    }
    global_flags[ac]=x;
    //atomicAdd(&ac,1);
    ++ac;

    atomicExch(&mutex,0);

}

So each thread will try to obtain the lock, and then write its thread id to global_flags[ac], and increase ac. Since the incrementation of ac is completed with the lock held, I assume it is not needed to use atomicAdd, and ++ac is enough. When I launch a graph containing this kernel, or launch the same executable graph on two streams, that works, but when I launch two executable graphs on two streams, the result is random. The final ac will not be equal to the total threads, but a random number. I have to use atomicAdd instead of ++ ac in this case. Why would this happen? Is the lock somehow obtained by both sides?

The rest of the code for result demonstration is as below:

    cudaGraph_t g;
    cudaStream_t stream[2];
    cudaEvent_t a,b;
    cudaEventCreate(&a);cudaEventCreate(&b);
    for(size_t i=0;i<2;i++)
        cudaStreamCreate(stream+i);
    cudaStreamBeginCapture(stream[0],cudaStreamCaptureModeGlobal);
    test_mul<<< 1,20, 0,stream[0]>>>(1);
    cudaEventRecord(a,stream[0]);
    cudaStreamWaitEvent(stream[1],a);
    test_mul<<<1,20,0,stream[1]>>>(10);
    cudaEventRecord(b,stream[1]);
    cudaStreamWaitEvent(stream[0],b);
    test_mul<<<1,20,0,stream[0]>>>(100);
    cudaStreamEndCapture(stream[0],&g)
    cudaGraphExec_t g_exec1,g_exec2;
    cudaGraphInstantiate(&g_exec1,g,NULL,NULL,0);
    cudaGraphInstantiate(&g_exec2,g,NULL,NULL,0);
   

test 1:

    size_t const TEST_ARR_LEN=200;
    cudaGraphLaunch(g_exec1,stream[0]);
    cudaGraphLaunch(g_exec1,stream[1]);
    size_t res[TEST_ARR_LEN];
    cudaMemcpyFromSymbol(res,global_flags,TEST_ARR_LEN);
    for(size_t i=0;i<TEST_ARR_LEN;i++)
        std::cout<<i<<" ";

test 2:

    cudaGraphLaunch(g_exec1,stream[0]);
    cudaGraphLaunch(g_exec2,stream[1]);

Result is listed below, and the colors are used to highlight each sequence produced by the same kernel launch:
test 1 result:


test 2 result:

test 2 with atomicAdd(&ac, 1):

A cudaGraphExec_t cannot run concurrently with itself. A launch of a cudaGraphExec_t will be ordered after previous launches of the same executable graph.

Regarding the random results, you need to ensure that the memory operations within the critical section are visible to all threads. Otherwise some threads will observe outdated values of ac. Atomics are a possible solution, another would be the explicit use of memory fences.

1 Like

So a custom lock mechanism using atomicExchange doesn’t work between streams? I thought the stream is only a kind of software abstraction, and the atomic operations on ac is not needed since each thread will only read and write it between the two atomicExchange call.

This isn’t really a stream issue. Thread ordering and memory ordering are two different concepts. With the lock, only one thread at a time can modify acc . However, there is no guarantee that the modification is visible to the next thread. This is explained in the programming guide, Section 7.5. 1. Introduction — CUDA C Programming Guide

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread

1 Like