NVCC fails to generate volatile memory instructions

I’m running into a problem where NVCC is not generating volatile memory instructions for some variables declared ‘volatile’. The non-volatile instructions can be seen in the PTX output. Is this a bug, or am I using type qualifiers incorrectly? The source code below reproduces the problem.

In the real kernel, there are shared objects that can be read and written by any thread block. Ordinary loads don’t work because they may return stale data from cache.

Operating system is x86-64 Debian linux (squeeze)

Host compiler is gcc 4.4.5

CUDA compiler (nvcc --version) is Cuda compilation tools, release 3.2, V0.2.1221

Example code:

// nvcc file.cu -ptx -arch=sm_20

struct Foo {

  typedef volatile Foo *Elem;

  Elem next;

};

struct Queue {

  typedef volatile Foo *Elem;

  volatile Elem head;

  volatile Elem tail;

};

__global__ void

example_kernel(Queue *q, Queue::Elem new_tail) {

  // Load and store of "q->tail" is not volatile

  q->tail->next = new_tail;

  q->tail = new_tail;

}

extern void

run_example_kernel(Queue *head, Queue::Elem new_tail) {

  example_kernel<<<1, 1>>>(head, new_tail);

}