4.0 provides support and documentation for asm() directives! These are invaluable for a lot of small tight features… for example I use simple 128 bit integer math for some of my PRNG designs. The hardware handles this great (it’s just a chain of adds) but that couldn’t be written in C before since not all PTX features were exposed (like using add with carry) I’ve been using this for a year now. (The asm() feature has been in nvcc all along, it’s just been undocumented and unsupported.)
But now we have docs on asm() and I can ask questions!
asm() is not discussed in the Programming Guide (it’s not even mentioned) but there’s a seperate .pdf in the toolkit: “Using Inline PTX”.
Page 7 has an interesting but confusing section “Incorrect Optimization”.
I see why the volatile can be useful for a clock query statement: you want the clock to be evaluated now since you’re likely bracketing some computation to time it.
But I don’t understand the use of the “memory” qualifier. In the example, the destination is clearly a write (by mere syntax, “=r” shows its a write). Is a memory write specifically to global or shared memory (not a register)? But that doesn’t make sense with the example… the mov.u32 statement isn’t going to have any side effects on memory, so why the memory clobber?
I guess what I’m asking is for a good example where the memory clobber is needed, or an explanation why it’s needed for this clock query example.