volatile breaks coalescing for vector types volatile trick can backfire.

Be warned, the volatile trick sometimes backfires.

cuComplex *g_vectorptr; // points to a location in global memory

int index = threadIdx.x;
g_vectorptr[index] = make_cuComplex(1.0f, 2.0f);

volatile int index2 = threadIdx.x;
g_vectorptr[index2] = make_cuComplex(1.0f, 2.0f);

The first expression will be coalesced (a single st.global.v2.f32 write per element), the second write create two individual float writes (st.global.f32) which breaks coalescing. Took me a couple of hours to figure out why my code was running “suboptimally”

Same with float2, float4, and likely all other vector types.

Assuming your goal here is to ensure that the whole of the structure gets written to global memory, then you’ve put the volatile in the wrong place. You marked index as volatile, which means reads and writes to the index variable will not be optimized out. What you need to mark as volatile is one of the items involved in the copy (either the source or the destination). You can cast the rhs to (volatile cuComplex), or you can declare g_vectorptr as “volatile cuComplex*” (pointer to a volatile cuComplex object).

Apart from not making a whole lot of sense,that defeats the whole idea of the “volatile trick”, which is to make the compiler spill variables defined as volatile from register to local memory to reduce the kernel register footprint.

Both of you are not familiar with the trick.It does not affect use of local memory as far as I know.

Volatile forces allocation of a local variable to a register right away and its result to be evaluated immediately (if it that variable eligible to go into a register, i.e. not being a large data structure or array). It also disables the automatic expression inlining that the compiler likes to perform in such cases, thus keeping needless index and variable calculations out of a tight inner loop. It also can have the side effect of sometimes reducing the overall register use of a kernel. The trick also works for floating point constants, that would otherwise be loaded over and over into (possibly different) registers.

That it breaks coalescing for vector types is a potentially harmful side effect that one has to keep in mind. The above sample only demonstrates that effect, but does not make any reasonable use of the volatile keyword.

Mea culpa.

I have used it a couple of times before to save a register, but I presumed (almost always a bad thing) that it was functionally equivalent to the maxregccount option for the compiler, so that variable references get replaced by read instructions from local memory.

That does work… sometimes. I’ve seen kernels where volatile variables get ‘unvolatiled’ back when translated from ptx into cuda binary.

Maybe reading an index first into a temp variable should help ?

Frankly speaking I consider such behavior a compiler bug.

For the record, I was referring to the volatile trick discussed in this thread, which it is now clear is a different volatile trick than the one you’re using.

I have also seen behavior like this when the index calculation involves an element of an array of shared memory, for example in the following kernel:

[codebox]shared int index[1];

global void my_kernel(uint4 *gptr,int f) {

if(f) index[0] = 15;

else index[0] = 16;

uint4 m;

m.x = 0;

m.y = 1;

m.z = 2;

m.w = 3;

gptr[ index[0] ] = m;


Wooha! I wasn’t aware of this version of the trick ;) There’s just too much trickery going on here.