blocking behavior of LD/ST from/to global memory

Hi,

are loads and stores from global into shared memory blocking?

If not how to enforce it?

If yes how to avoid it (to do memory prefetch)?

I have the following kernel:

load first tile from global mem to shared mem

for{

 load next tile from global mem to shared mem; //perform prefetching

 compute current tile();

 store result to global mem;

 __syncthreads();

}

What I would like to achieve is the following:

  1. overlap the compute step with the prefetching step, so the ld latency does not stall my kernel.

  2. Assure that the previous load (prefetch) has been completed so I can savely compute my next tile.

  3. Do totally asyn global stores. The kernel does not care when the store transaction has completed.

Does the syncthreads() already do the trick? I am not sure as syncing threads has nothing to do with blockin behaviour of the load.

Nice idea,

From what I have read from more expert people in this forum the load causes a wait at the point where the data is needed. i.e. it happens as late as possible in code
e.g. what will happen is
LD issued for data from global mem
do whatever doesnt depend on the data
do the step that is dependant on the global. << there is a wait before this runs

From this I think that your idea isnt going to work iff the loads store data in the same section of shared array. I think it can be done with data pipelining (forum thread a few months ago by Uncle Joe). Might also work if you make the loads alternate where they store data but this means having a shared array that is twice as long, so check using the occupancy calculator if that will reduce the number of blocks that can run.

Another likely problem is that if a thread is dependant on data read by threads in another warp then you need the synchthreads() between the load and where the data is used in order to make sure all warps have done the load before any try to use the data.

If your grid will have a lot more blocks than you have MP’s then better just letting the latency be hidden by the GPU running more blocks per MP and without doing anything fancy at all (except having synchthread in right place).

double post

HI kbam,
thanks a lot. Yes I am using a double buffering scheme so I do not use the loaded data immediately but store it into a second set of registers.

I suppose that writes are non-blocking as well then?

Is there any documentation on such things?