Having issue on global memory access from a kernel…the code below represents what I am doing…reading the documentation I tried all I could find with no effect ( notice volatile/threadfence below)…
global
void kernel_1(volatile unsigned int data)
{
unsigned int src_row_idx = ((blockIdx.y * blockDim.y) + threadIdx.y);
unsigned int value = 42;
//use value after here, works ALWAYS
data[src_row_idx] = value;
//use data[src_row_idx] after here, works INTERMITTENTLY
'volatile', thread_fence(), etc do NOT solve the issue
//wait on the data WORKS
unsigned int num = 0;
while (num <= 0)
{
num = data[src_row_idx]
}
//use the num or value here
To be able to effectively communicate from one thread to another you need several things:
appropriate memory fencing/memory barriers
appropriate execution barriers (synchronization between threads)
making sure (L1) caches do not interfere
When the two threads doing the communicating are in the same threadblock, the use of __syncthreads() for example is generally sufficient.
thread x writes to a global or shared location
__syncthreads()
thread y reads from the global or shared location
volatile is generally not necessary for the above paradigm, because the L1 cache is coherent with respect to a given threadblock.
when the threads in question are not in the same threadblock, you still need the same 3 bases covered, but __syncthreads() is no longer sufficient, by itself, to satisfy the needs. A cuda defined method for this would be to use cooperative groups (CG) grid-wide sync, and do something like:
thread x writes to a global location
this_grid.sync()
thread y reads from the global location
Depending on how exactly you provide for the general needs 1,2,3 in the grid case, volatile may or may not be necessary. If you review the CG sample codes, I believe you will find some that are doing grid-wide sync, with thread-to-thread communication, without volatile.
note that write/load instructions in cuda only run in batch if they are coalesced accessing, this will depend on what kind of hardware you running and how you access data.
If any of those accesses is not coalesced with the rest, it will be “queued” and access later.
For your code __syncwarp() can make it work in warp level.