Suppose I instantiate two executable graphs exec1
and exec2
from a graph g
, what is the difference between launching them on two streams and launching the same executable graph exec1
on two different streams? Some result I encountered recently seems to imply the graph launch will be concurrent in the former case, but not in the latter one. Why is the difference?
Additionally, what happens if the two graphs being launched try to access the same memory area?
I write the code below:
__device__ int mutex;
__device__ int acc;
__device__ size_t global_flags[200];
With a kernel like this:
__global__ void test_mul(size_t frac=1){
auto x=threadIdx.x*frac;
int lock=1;
while(lock!=0){
lock=atomicExch(&mutex,1);
}
global_flags[ac]=x;
//atomicAdd(&ac,1);
++ac;
atomicExch(&mutex,0);
}
So each thread will try to obtain the lock, and then write its thread id to global_flags[ac]
, and increase ac
. Since the incrementation of ac
is completed with the lock held, I assume it is not needed to use atomicAdd
, and ++ac
is enough. When I launch a graph containing this kernel, or launch the same executable graph on two streams, that works, but when I launch two executable graphs on two streams, the result is random. The final ac will not be equal to the total threads, but a random number. I have to use atomicAdd
instead of ++ ac
in this case. Why would this happen? Is the lock somehow obtained by both sides?
The rest of the code for result demonstration is as below:
cudaGraph_t g;
cudaStream_t stream[2];
cudaEvent_t a,b;
cudaEventCreate(&a);cudaEventCreate(&b);
for(size_t i=0;i<2;i++)
cudaStreamCreate(stream+i);
cudaStreamBeginCapture(stream[0],cudaStreamCaptureModeGlobal);
test_mul<<< 1,20, 0,stream[0]>>>(1);
cudaEventRecord(a,stream[0]);
cudaStreamWaitEvent(stream[1],a);
test_mul<<<1,20,0,stream[1]>>>(10);
cudaEventRecord(b,stream[1]);
cudaStreamWaitEvent(stream[0],b);
test_mul<<<1,20,0,stream[0]>>>(100);
cudaStreamEndCapture(stream[0],&g)
cudaGraphExec_t g_exec1,g_exec2;
cudaGraphInstantiate(&g_exec1,g,NULL,NULL,0);
cudaGraphInstantiate(&g_exec2,g,NULL,NULL,0);
test 1:
size_t const TEST_ARR_LEN=200;
cudaGraphLaunch(g_exec1,stream[0]);
cudaGraphLaunch(g_exec1,stream[1]);
size_t res[TEST_ARR_LEN];
cudaMemcpyFromSymbol(res,global_flags,TEST_ARR_LEN);
for(size_t i=0;i<TEST_ARR_LEN;i++)
std::cout<<i<<" ";
test 2:
cudaGraphLaunch(g_exec1,stream[0]);
cudaGraphLaunch(g_exec2,stream[1]);
Result is listed below, and the colors are used to highlight each sequence produced by the same kernel launch:
test 1 result:
test 2 result:
test 2 with atomicAdd(&ac, 1):