4.0 provides support and documentation for asm() directives! These are invaluable for a lot of small tight features… for example I use simple 128 bit integer math for some of my PRNG designs. The hardware handles this great (it’s just a chain of adds) but that couldn’t be written in C before since not all PTX features were exposed (like using add with carry) I’ve been using this for a year now. (The asm() feature has been in nvcc all along, it’s just been undocumented and unsupported.)
But now we have docs on asm() and I can ask questions!
asm() is not discussed in the Programming Guide (it’s not even mentioned) but there’s a seperate .pdf in the toolkit: “Using Inline PTX”.
Page 7 has an interesting but confusing section “Incorrect Optimization”.
I see why the volatile can be useful for a clock query statement: you want the clock to be evaluated now since you’re likely bracketing some computation to time it.
But I don’t understand the use of the “memory” qualifier. In the example, the destination is clearly a write (by mere syntax, “=r” shows its a write). Is a memory write specifically to global or shared memory (not a register)? But that doesn’t make sense with the example… the mov.u32 statement isn’t going to have any side effects on memory, so why the memory clobber?
I guess what I’m asking is for a good example where the memory clobber is needed, or an explanation why it’s needed for this clock query example.
The Fermi compatibility guide discussion on intrawarp shared memory synchronization suggests that the sm_20 assembler might take it upon itself to eliminate an intermediate write and hold a partial result in register. The “memory” specifier might be a guard against such an optimization. Otherwise it doesn’t seem to make much sense.
Maybe the use case is that you want to time a portion of code, and especially some loads and stores. By introducing an artificial memory clobber statement, you prevent the compiler from reordering loads and stores across the “mov %0, %%clock” instruction.
That would be the latter part of the description “if there is a hidden side effect on user memory, or if you want to stop any memory optimizations around the asm() statement”.
suppose that before your asm stmt there is an instruction which modifies a memory location, which the optimizer transform in a register write plus a register-to-memory move. the “memory” qualifier could stop the latter from moving past your truly…
“If your assembler instructions access memory in an unpredictable fashion, add memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory. You will also want to add the volatile keyword if the memory affected is not listed in the inputs or outputs of the asm, as the memory’ clobber does not count as a side-effect of the asm.”