blocking behavior of LD/ST from/to global memory

cudahacker · April 28, 2011, 11:00am

Hi,

are loads and stores from global into shared memory blocking?

If not how to enforce it?

If yes how to avoid it (to do memory prefetch)?

I have the following kernel:

load first tile from global mem to shared mem

for{

 load next tile from global mem to shared mem; //perform prefetching

 compute current tile();

 store result to global mem;

 __syncthreads();

}

What I would like to achieve is the following:

overlap the compute step with the prefetching step, so the ld latency does not stall my kernel.
Assure that the previous load (prefetch) has been completed so I can savely compute my next tile.
Do totally asyn global stores. The kernel does not care when the store transaction has completed.

Does the syncthreads() already do the trick? I am not sure as syncing threads has nothing to do with blockin behaviour of the load.

kbam · April 29, 2011, 12:03am

Nice idea,

From what I have read from more expert people in this forum the load causes a wait at the point where the data is needed. i.e. it happens as late as possible in code
e.g. what will happen is
LD issued for data from global mem
do whatever doesnt depend on the data
do the step that is dependant on the global. << there is a wait before this runs

From this I think that your idea isnt going to work iff the loads store data in the same section of shared array. I think it can be done with data pipelining (forum thread a few months ago by Uncle Joe). Might also work if you make the loads alternate where they store data but this means having a shared array that is twice as long, so check using the occupancy calculator if that will reduce the number of blocks that can run.

Another likely problem is that if a thread is dependant on data read by threads in another warp then you need the synchthreads() between the load and where the data is used in order to make sure all warps have done the load before any try to use the data.

If your grid will have a lot more blocks than you have MP’s then better just letting the latency be hidden by the GPU running more blocks per MP and without doing anything fancy at all (except having synchthread in right place).

cudahacker · April 29, 2011, 7:28am

double post

cudahacker · April 29, 2011, 7:30am

HI kbam,
thanks a lot. Yes I am using a double buffering scheme so I do not use the loaded data immediately but store it into a second set of registers.

I suppose that writes are non-blocking as well then?

Is there any documentation on such things?

Topic		Replies	Views
__syncthreads and blocking memory accesses CUDA Programming and Performance	1	3766	February 5, 2009
__syncthreads() and global memory CUDA Programming and Performance	1	2495	December 1, 2008
Data load question CUDA Programming and Performance	3	93	December 18, 2024
Avoiding global memory ordering by __syncthreads CUDA Programming and Performance	1	615	May 30, 2017
Using some threads for data load data to shared mem only CUDA Programming and Performance	3	2063	March 3, 2009
Dare I use L1 in this way? CUDA Programming and Performance	2	359	October 14, 2023
Load data into shared memory in "if statement" CUDA Programming and Performance	3	6023	January 9, 2011
Any way to guarentee writes have made it to global memory? CUDA Programming and Performance	1	743	September 23, 2009
shared memory and syncthreads question CUDA Programming and Performance	2	1268	March 3, 2009
life span of shared memory CUDA Programming and Performance	15	7122	April 27, 2011

blocking behavior of LD/ST from/to global memory

Related topics