Hi,
I’m launching a cuda kernel (16 blocks,1024 threads) which creates linked lists dynamically in global memory and the linked lists are shared between blocks and I’m running into the issue where the linked lists and other global data is not updated and the other blocks use incorrect data.
From my understanding thread_fence would update the data in global memory for all the blocks but when I debug it is not always the case.
In the simple example below that I’m executing, data[0]=9 always, data[1] =7 and sometimes data[1]=13, it is completely inconsistent. However when I use cudaMalloc or regular malloc data[1] = 7 always strangely.
I’ve tried using the volatile keyword and also disabling the L1 cache but both result in inconsistent results.
How can I fix, this so that whenever anything in global memory is updated all the blocks can see this updated data? Could using cudaLaunchCooperativeKernel help this by synchronizing multiple blocks? Or multiple kernel launches?
global void kernel(testData * data)
{
if(blockId == 2 && threadId == 1)
{
// int * x = reinterpret_cast<int*>(malloc(sizeof(int)));
// int *x;
// cudaMalloc(&x, sizeof(int)*5);
// cudaFree(x);
testData[0] = 7;
__threadfence_system();
}
if(blockId == 14 && threadId == 2)
{
testData[1] = 9;
if(testData[0] == 7)
testData[0]+= 6;
__threadfence_system();
}
}