Access Global memory from kernel

Having issue on global memory access from a kernel…the code below represents what I am doing…reading the documentation I tried all I could find with no effect ( notice volatile/threadfence below)…

I have encountered this problem many times, been all over the web and I found this just today, but it does NOT have an effect…

What other factors control red/write access to global memory?

On the host:

.h file
volatile unsigned int* data;

.cpp file
main ()
unsigned int size = (sizeof(unsigned int) * 10 * 32;
cudaMalloc(&data, size);

	CUDA_test(volatile unsigned int* data)

.cu file
void CUDA_test(volatile unsigned int* data)
dim3 thr_p_block(32, 32);
dim3 num_blocks_(1, 10);

kernel_1 <<<num_blocks_, thr_p_block>>> (data);


void kernel_1(volatile unsigned int data)
unsigned int src_row_idx = ((blockIdx.y * blockDim.y) + threadIdx.y);

unsigned int value = 42;
//use value after here, works ALWAYS

data[src_row_idx] = value;
//use data[src_row_idx] after here, works INTERMITTENTLY

'volatile', thread_fence(), etc do NOT solve the issue 

//wait on the data WORKS
unsigned int num = 0;
while (num <= 0)
	num = data[src_row_idx]

    //use the num or value here


To be able to effectively communicate from one thread to another you need several things:

  1. appropriate memory fencing/memory barriers
  2. appropriate execution barriers (synchronization between threads)
  3. making sure (L1) caches do not interfere

When the two threads doing the communicating are in the same threadblock, the use of __syncthreads() for example is generally sufficient.

  • thread x writes to a global or shared location
  • __syncthreads()
  • thread y reads from the global or shared location

volatile is generally not necessary for the above paradigm, because the L1 cache is coherent with respect to a given threadblock.

when the threads in question are not in the same threadblock, you still need the same 3 bases covered, but __syncthreads() is no longer sufficient, by itself, to satisfy the needs. A cuda defined method for this would be to use cooperative groups (CG) grid-wide sync, and do something like:

  • thread x writes to a global location
  • this_grid.sync()
  • thread y reads from the global location

Depending on how exactly you provide for the general needs 1,2,3 in the grid case, volatile may or may not be necessary. If you review the CG sample codes, I believe you will find some that are doing grid-wide sync, with thread-to-thread communication, without volatile.

This may also be useful reading.

note that write/load instructions in cuda only run in batch if they are coalesced accessing, this will depend on what kind of hardware you running and how you access data.
If any of those accesses is not coalesced with the rest, it will be “queued” and access later.
For your code __syncwarp() can make it work in warp level.