volatile breaks coalescing for vector types volatile trick can backfire.

cbuchner1 · November 2, 2009, 7:11pm

Be warned, the volatile trick sometimes backfires.

cuComplex *g_vectorptr; // points to a location in global memory

int index = threadIdx.x;
g_vectorptr[index] = make_cuComplex(1.0f, 2.0f);

volatile int index2 = threadIdx.x;
g_vectorptr[index2] = make_cuComplex(1.0f, 2.0f);

The first expression will be coalesced (a single st.global.v2.f32 write per element), the second write create two individual float writes (st.global.f32) which breaks coalescing. Took me a couple of hours to figure out why my code was running “suboptimally”

Same with float2, float4, and likely all other vector types.

stencil · November 4, 2009, 2:43am

Assuming your goal here is to ensure that the whole of the structure gets written to global memory, then you’ve put the volatile in the wrong place. You marked index as volatile, which means reads and writes to the index variable will not be optimized out. What you need to mark as volatile is one of the items involved in the copy (either the source or the destination). You can cast the rhs to (volatile cuComplex), or you can declare g_vectorptr as “volatile cuComplex*” (pointer to a volatile cuComplex object).

avidday · November 4, 2009, 6:34am

Apart from not making a whole lot of sense,that defeats the whole idea of the “volatile trick”, which is to make the compiler spill variables defined as volatile from register to local memory to reduce the kernel register footprint.

cbuchner1 · November 4, 2009, 9:34am

Both of you are not familiar with the trick.It does not affect use of local memory as far as I know.

Volatile forces allocation of a local variable to a register right away and its result to be evaluated immediately (if it that variable eligible to go into a register, i.e. not being a large data structure or array). It also disables the automatic expression inlining that the compiler likes to perform in such cases, thus keeping needless index and variable calculations out of a tight inner loop. It also can have the side effect of sometimes reducing the overall register use of a kernel. The trick also works for floating point constants, that would otherwise be loaded over and over into (possibly different) registers.

That it breaks coalescing for vector types is a potentially harmful side effect that one has to keep in mind. The above sample only demonstrates that effect, but does not make any reasonable use of the volatile keyword.

avidday · November 4, 2009, 10:31am

Mea culpa.

I have used it a couple of times before to save a register, but I presumed (almost always a bad thing) that it was functionally equivalent to the maxregccount option for the compiler, so that variable references get replaced by read instructions from local memory.

sergeyn · November 4, 2009, 3:34pm

That does work… sometimes. I’ve seen kernels where volatile variables get ‘unvolatiled’ back when translated from ptx into cuda binary.

Maybe reading an index first into a temp variable should help ?

Frankly speaking I consider such behavior a compiler bug.

stencil · November 4, 2009, 6:58pm

For the record, I was referring to the volatile trick discussed in this thread, which it is now clear is a different volatile trick than the one you’re using.

I have also seen behavior like this when the index calculation involves an element of an array of shared memory, for example in the following kernel:

[codebox]shared int index[1];

global void my_kernel(uint4 *gptr,int f) {

if(f) index[0] = 15;

else index[0] = 16;

uint4 m;

m.x = 0;

m.y = 1;

m.z = 2;

m.w = 3;

gptr[ index[0] ] = m;

}[/codebox]

cbuchner1 · November 4, 2009, 7:40pm

Wooha! I wasn’t aware of this version of the trick ;) There’s just too much trickery going on here.

Topic		Replies	Views
Coalescing issue, presumably due to the CUDA Optimizer CUDA Programming and Performance	18	3197	December 9, 2009
Improve performance using volatile CUDA Programming and Performance	8	7074	July 15, 2011
Cannot coalesce global memory reads using builtin vector types CUDA Programming and Performance	6	3344	July 14, 2010
Float2 Coalescing CUDA Programming and Performance	12	10755	December 9, 2010
Using volatile What have i actually done? CUDA Programming and Performance	12	4177	August 13, 2008
Volatile - when to use? (Regarding registers) CUDA Programming and Performance	3	12172	March 17, 2012
Getting nvcc to consolidate registers CUDA Programming and Performance	19	19544	November 19, 2012
read global memory conflict CUDA Programming and Performance	4	2677	July 16, 2009
Coalesced Memory Access to Structs CUDA Programming and Performance	11	4663	September 19, 2009
Is it possible to avoid optimization? Compiler breaks memory coalescing. CUDA Programming and Performance	7	3167	March 14, 2008

volatile breaks coalescing for vector types volatile trick can backfire.

Related topics