Atomic functions and volatile shared memory declarations.

It is my understanding that Fermi lacks instructions to directly operate on shared memory locations, and instead operations are performed in register and then stored to shared memory with explicit store instructions. This can mean that compiler optimizations can break implicit synchronization between threads in a warp by removing the shared memory load and store instructions and maintaining intermediate values of a code section in register. The current documentation recommends declaring any shared memory used in this fashion to be volatile on Fermi, which will force the compiler to honor the shared memory load and store instructions required to make implicit synchronization within a warp work correctly.

As best as I can tell, this introduces a new problem: the templated declarations for atomic functions in the toolkit don’t include instantiations for volatile types. So while this code snippet compiles

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

neither this:

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	volatile unsigned int * Lval = &Ltile[threadIdx.x][threadIdx.y];

	if (converged) { atomicExch(Lval, 0); }

nor this:

volatile __shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

will because of C++ pedantry surrounding the volatile keyword. This is using gcc 4.3 with the 3.2rc2 nvcc release, but I doubt it will be different on any other sensible C++ compiler.

Any change this could be fixed? It is proving to be a major pain in the code I am working on at the moment.

It is my understanding that Fermi lacks instructions to directly operate on shared memory locations, and instead operations are performed in register and then stored to shared memory with explicit store instructions. This can mean that compiler optimizations can break implicit synchronization between threads in a warp by removing the shared memory load and store instructions and maintaining intermediate values of a code section in register. The current documentation recommends declaring any shared memory used in this fashion to be volatile on Fermi, which will force the compiler to honor the shared memory load and store instructions required to make implicit synchronization within a warp work correctly.

As best as I can tell, this introduces a new problem: the templated declarations for atomic functions in the toolkit don’t include instantiations for volatile types. So while this code snippet compiles

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

neither this:

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	volatile unsigned int * Lval = &Ltile[threadIdx.x][threadIdx.y];

	if (converged) { atomicExch(Lval, 0); }

nor this:

volatile __shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

will because of C++ pedantry surrounding the volatile keyword. This is using gcc 4.3 with the 3.2rc2 nvcc release, but I doubt it will be different on any other sensible C++ compiler.

Any change this could be fixed? It is proving to be a major pain in the code I am working on at the moment.

As a workaround, you could make an overloaded AtomicExch() with volatile arguments, then inside cast the volatile away and call the original AtomicExch.
Then your main code doesn’t need to do anything special or have awkward casts… it’s all taken care of by the overload.

Easy, but not elegant since you need to include a little header in all your files.

As a workaround, you could make an overloaded AtomicExch() with volatile arguments, then inside cast the volatile away and call the original AtomicExch.
Then your main code doesn’t need to do anything special or have awkward casts… it’s all taken care of by the overload.

Easy, but not elegant since you need to include a little header in all your files.

That is what I have done, but I am not very happy about having to overload “intrinsic” functions everywhere just to get things to compile…

That is what I have done, but I am not very happy about having to overload “intrinsic” functions everywhere just to get things to compile…

Old thread but the cleanest workaround I’ve come up with is to define both volatile and non-volatile fields and enclose them in an anonymous (or named) union. Problem solved without ugly casting:

You’ll notice I prefixed the atomic variant with “atomic” as a reminder that this field should only be referenced by atomic ops.

Also, for brevity, I don’t show the shared declaration or any enclosing structs. :)