hiding global memory access do I need 2 warps?

Cygnus_X1 · January 19, 2010, 6:05pm

I am trying to hide time-consuming global memory atomic operation with some independent computation.

Consider the following code:

__shared__ int shVar;

if (threadIdx.x==0)

  shVar=atomicAdd(&global,1);

[...]

//do some computation that does not read from global memory and does not depend on shVar

[...]

int something=globalArray[shVar];

[...]

If I assign a single warp to execute the above, will it wait at atomicAdd to return or is the compiler/driver/hardware smart enough to detect that it actually has to wait only later, at first reference to shVar?

In more general, are all global memory operations completly synchronious or can a warp continue its execution a bit, when it can?

One could easily replace the above example with:

__shared__ int shVar;

if (threadIdx.x==32)

  shVar=atomicAdd(&global,1);

if (threadIdx.x<32) {

  [...]

  //do some computation that does not read from global memory and does not depend on shVar

  [...]

}

__syncthreads();

int something=globalArray[shVar];

[...]

… and assign 2 warps to do it, but since we do not know in which order warps get executed I have no guarantee that atomic operation will be issued before my computation. In worst case scenario, warp 0 could reach __syncthreads() barrier while warp 1 did not even started the atomic operation!

Or maybe there is third solution?

jaka1 · January 22, 2010, 3:17am

I am trying to hide time-consuming global memory atomic operation with some independent computation.

Consider the following code:
__shared__ int shVar;

if (threadIdx.x==0)

  shVar=atomicAdd(&global,1);

[...]

//do some computation that does not read from global memory and does not depend on shVar

[...]

int something=globalArray[shVar];

[...]
If I assign a single warp to execute the above, will it wait at atomicAdd to return or is the compiler/driver/hardware smart enough to detect that it actually has to wait only later, at first reference to shVar?

In more general, are all global memory operations completly synchronious or can a warp continue its execution a bit, when it can?

One could easily replace the above example with:
__shared__ int shVar;

if (threadIdx.x==32)

  shVar=atomicAdd(&global,1);

if (threadIdx.x<32) {

  [...]

  //do some computation that does not read from global memory and does not depend on shVar

  [...]

}

__syncthreads();

int something=globalArray[shVar];

[...]
… and assign 2 warps to do it, but since we do not know in which order warps get executed I have no guarantee that atomic operation will be issued before my computation. In worst case scenario, warp 0 could reach __syncthreads() barrier while warp 1 did not even started the atomic operation!

Or maybe there is third solution?

I also wanted to do something similar. However, it did not give the expected results. May be the compiler (or CUDA itself) automatically optimizes the global memory access so that the data is fetched beforehand.

This kind of tricks usually do not give the expected results probably due to strange optimizations.

Topic		Replies	Views
warp scheduling CUDA Programming and Performance	5	2686	August 7, 2009
Global Memory Fetches How to arrange them in code for best performance CUDA Programming and Performance	6	1224	June 2, 2010
Atomic operations across a warp in parallel for CC2.0 devices CUDA Programming and Performance	0	2621	June 29, 2010
Clarification on Memory Access issue CUDA Programming and Performance	1	3730	September 9, 2009
Shared memory atomics and other questions. CUDA Programming and Performance	19	13781	November 13, 2010
contention in atomics CUDA Programming and Performance	2	576	April 24, 2018
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14160	November 18, 2008
A good idea or not ? need advice CUDA Programming and Performance	3	4352	January 11, 2010
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6551	May 9, 2011
__syncthreads and shared memory CUDA Programming and Performance	21	4400	June 15, 2011

hiding global memory access do I need 2 warps?

Related topics