Access Global memory from kernel

dyurkanin · December 8, 2020, 8:39pm

Having issue on global memory access from a kernel…the code below represents what I am doing…reading the documentation I tried all I could find with no effect ( notice volatile/threadfence below)…

I have encountered this problem many times, been all over the web and I found this just today c++ - CUDA writes to global memory not seen by other warps - Stack Overflow, but it does NOT have an effect…

What other factors control red/write access to global memory?

On the host:

.h file
volatile unsigned int* data;

.cpp file
main ()
{
unsigned int size = (sizeof(unsigned int) * 10 * 32;
cudaMalloc(&data, size);

	CUDA_test(volatile unsigned int* data)
}

.cu file
void CUDA_test(volatile unsigned int* data)
{
dim3 thr_p_block(32, 32);
dim3 num_blocks_(1, 10);

kernel_1 <<<num_blocks_, thr_p_block>>> (data);

}

global
void kernel_1(volatile unsigned int data)
{
unsigned int src_row_idx = ((blockIdx.y * blockDim.y) + threadIdx.y);

unsigned int value = 42;
//use value after here, works ALWAYS

data[src_row_idx] = value;
//use data[src_row_idx] after here, works INTERMITTENTLY

'volatile', thread_fence(), etc do NOT solve the issue 

//wait on the data WORKS
unsigned int num = 0;
while (num <= 0)
{
	num = data[src_row_idx]
}

    //use the num or value here

}

Robert_Crovella · December 8, 2020, 9:00pm

To be able to effectively communicate from one thread to another you need several things:

appropriate memory fencing/memory barriers
appropriate execution barriers (synchronization between threads)
making sure (L1) caches do not interfere

When the two threads doing the communicating are in the same threadblock, the use of __syncthreads() for example is generally sufficient.

thread x writes to a global or shared location
__syncthreads()
thread y reads from the global or shared location

volatile is generally not necessary for the above paradigm, because the L1 cache is coherent with respect to a given threadblock.

when the threads in question are not in the same threadblock, you still need the same 3 bases covered, but __syncthreads() is no longer sufficient, by itself, to satisfy the needs. A cuda defined method for this would be to use cooperative groups (CG) grid-wide sync, and do something like:

thread x writes to a global location
this_grid.sync()
thread y reads from the global location

Depending on how exactly you provide for the general needs 1,2,3 in the grid case, volatile may or may not be necessary. If you review the CG sample codes, I believe you will find some that are doing grid-wide sync, with thread-to-thread communication, without volatile.

This may also be useful reading.

lqhuy93 · December 15, 2020, 6:36pm

note that write/load instructions in cuda only run in batch if they are coalesced accessing, this will depend on what kind of hardware you running and how you access data.
If any of those accesses is not coalesced with the rest, it will be “queued” and access later.
For your code __syncwarp() can make it work in warp level.

Topic		Replies	Views
Force flush to global memory on grid level in cooperative kernels CUDA Programming and Performance	5	1342	August 13, 2019
Conditional write to global memory CUDA Programming and Performance	3	3530	September 4, 2007
read global memory conflict CUDA Programming and Performance	4	2725	July 16, 2009
global memory read after write CUDA Programming and Performance	4	3363	March 25, 2009
Global memory coherence in compute capability 2.0 Does __threadfence() really do what's on the t CUDA Programming and Performance	1	3751	April 11, 2012
__thread_fence not working for global memory when there is malloc or CudaMalloc in the kernel Jetson AGX Orin cuda	2	548	April 18, 2023
Read a value in global memory which was written by another thread block CUDA Programming and Performance	5	1786	October 15, 2014
prevent optimization of shared memory access, but allow optimizations of global acccess CUDA Programming and Performance	2	806	November 6, 2016
Removing RAW race in global memory using __threadfence() CUDA Programming and Performance	6	1114	July 26, 2013
__syncthreads() and global memory CUDA Programming and Performance	1	2500	December 1, 2008

Access Global memory from kernel

Related topics