PTX: st.wt versus st.volatile

Just curious, is there a difference between using st.wt and st.volatile? Seems like they both just ignore caching.

I also wondered about that. The documentation seems unclear:

The trouble is, what is “normal write-back?” Notice it “writes through the L2 cache” as opposed to “bypass” L2 or L1.

I think this updates both L1 and L2 (a completely normal write-back as the PTX manual says) EXCEPT that it also updates System Memory to make it visible to the CPU(NOT GPU!)

In short, if you use cuobjdump -sass to view how nvcc handles the cuda volatile keyword, nvcc apparently uses st.wt to handle that by default. In other words, st.wt is apparently an “update all levels” instruction (which is how you might want to interpret the generic cuda volatile keyword.

volatile (both in C and PTX) is really treated as a “inhibit optimization” flag for the assembler. It inhibits any modify-after-write type optimizations that might eliminate intermediate writes to memory. If you do something like this (classic shared memory reduction):

if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];

  if (threadIdx.x < 8)  buff[threadIdx.x] +=buff[threadIdx.x + 8];

  if (threadIdx.x < 4)  buff[threadIdx.x] +=buff[threadIdx.x + 4];

the assembler might well decide that the intermediate writes to buff can be eliminated, with the result held in register instead until the last write. That would break certain operations (like this reduction). volatile fixes that by forcing the intermediate writes to be honored.

Yes, but the OP’s question was much more specific than “what does volatile mean?” I would paraphrase/modify the OP’s question as follows: what does volatile mean for Fermi and the L2 cache?!