Atomic functions and volatile shared memory declarations.

avidday · November 22, 2010, 7:18am

It is my understanding that Fermi lacks instructions to directly operate on shared memory locations, and instead operations are performed in register and then stored to shared memory with explicit store instructions. This can mean that compiler optimizations can break implicit synchronization between threads in a warp by removing the shared memory load and store instructions and maintaining intermediate values of a code section in register. The current documentation recommends declaring any shared memory used in this fashion to be volatile on Fermi, which will force the compiler to honor the shared memory load and store instructions required to make implicit synchronization within a warp work correctly.

As best as I can tell, this introduces a new problem: the templated declarations for atomic functions in the toolkit don’t include instantiations for volatile types. So while this code snippet compiles

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

neither this:

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	volatile unsigned int * Lval = &Ltile[threadIdx.x][threadIdx.y];

	if (converged) { atomicExch(Lval, 0); }

nor this:

volatile __shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

will because of C++ pedantry surrounding the volatile keyword. This is using gcc 4.3 with the 3.2rc2 nvcc release, but I doubt it will be different on any other sensible C++ compiler.

Any change this could be fixed? It is proving to be a major pain in the code I am working on at the moment.

avidday · November 22, 2010, 7:18am

It is my understanding that Fermi lacks instructions to directly operate on shared memory locations, and instead operations are performed in register and then stored to shared memory with explicit store instructions. This can mean that compiler optimizations can break implicit synchronization between threads in a warp by removing the shared memory load and store instructions and maintaining intermediate values of a code section in register. The current documentation recommends declaring any shared memory used in this fashion to be volatile on Fermi, which will force the compiler to honor the shared memory load and store instructions required to make implicit synchronization within a warp work correctly.

As best as I can tell, this introduces a new problem: the templated declarations for atomic functions in the toolkit don’t include instantiations for volatile types. So while this code snippet compiles

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

neither this:

__shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	volatile unsigned int * Lval = &Ltile[threadIdx.x][threadIdx.y];

	if (converged) { atomicExch(Lval, 0); }

nor this:

volatile __shared__ unsigned int Ltile [4][16]; // 4 x 16 tile 

	if (converged) { atomicExch(&Ltile[threadIdx.x][threadIdx.y], 0); }

will because of C++ pedantry surrounding the volatile keyword. This is using gcc 4.3 with the 3.2rc2 nvcc release, but I doubt it will be different on any other sensible C++ compiler.

Any change this could be fixed? It is proving to be a major pain in the code I am working on at the moment.

SPWorley · November 22, 2010, 7:27am

As a workaround, you could make an overloaded AtomicExch() with volatile arguments, then inside cast the volatile away and call the original AtomicExch.
Then your main code doesn’t need to do anything special or have awkward casts… it’s all taken care of by the overload.

Easy, but not elegant since you need to include a little header in all your files.

SPWorley · November 22, 2010, 7:27am

As a workaround, you could make an overloaded AtomicExch() with volatile arguments, then inside cast the volatile away and call the original AtomicExch.
Then your main code doesn’t need to do anything special or have awkward casts… it’s all taken care of by the overload.

Easy, but not elegant since you need to include a little header in all your files.

avidday · November 22, 2010, 8:12am

That is what I have done, but I am not very happy about having to overload “intrinsic” functions everywhere just to get things to compile…

avidday · November 22, 2010, 8:12am

That is what I have done, but I am not very happy about having to overload “intrinsic” functions everywhere just to get things to compile…

allanmac · December 14, 2013, 11:35pm

Old thread but the cleanest workaround I’ve come up with is to define both volatile and non-volatile fields and enclose them in an anonymous (or named) union. Problem solved without ugly casting:

You’ll notice I prefixed the atomic variant with “atomic” as a reminder that this field should only be referenced by atomic ops.

Also, for brevity, I don’t show the shared declaration or any enclosing structs. :)

Topic		Replies	Views
Unsynchronized shared memory access CUDA Programming and Performance	5	1108	April 11, 2017
shared memory atomics vs volatile does volatile eliminate the need for atomics in shared mem? CUDA Programming and Performance	1	2482	March 24, 2012
Shared memory issue CUDA Programming and Performance	4	802	July 18, 2013
volatile in Cuda Fortran Legacy PGI Compilers	2	3247	November 18, 2011
pointer to shared memory compiler problems CUDA Programming and Performance	19	14832	June 7, 2008
Mixing volatile writes and atomicAdd()s with shared memory produces unexpected results CUDA Programming and Performance	8	817	March 13, 2024
Shared memory coherance problem within a thread CUDA Programming and Performance	3	816	November 17, 2010
reduction issues moving from 2.3 to 3.1 why are volatiles apparently needed now? CUDA Programming and Performance	4	1073	September 3, 2010
Volatile Keyword What exactly does it do? CUDA Programming and Performance	3	2440	February 17, 2009
Shared memory atomics and other questions. CUDA Programming and Performance	19	13920	November 13, 2010

Atomic functions and volatile shared memory declarations.

Related topics