low level hardware documentation

Doesn’t show up under level 0. Tried looking for it in a sm35 cubin and still no luck. Maybe I’m not using it right:

asm volatile("{\n\t"
            ".reg .u64 addr;\n\t"
            "cvta.global.u64 addr, %0;\n\t"
            "prefetchu.L1 [addr];\n\t" 
            "}" :: "l"(prefetch));

Doh… just realized that’s the constant version… hold on.

Ok, I see it now… of course adding that one instruction completely changed how my code was being unrolled. I hate how fickle ptxas is. Makes one want to reverse engineer it and write ones own… oh wait… :)

Anyway, code is now slowed down by a factor of 4. Maybe I’ll wrestle with ptxas some more or maybe I’ll give up.

PREFETCHU should be the analog to LDU, and as far a I know uniform accesses were abolished after Fermi, so it would make sense that PREFETCHU got dropped on the floor by PTXAS.

How did the loop change after you added PREFETCH? The unroll factor you can control manually, but there is no control over how the loop body is structured. The structure of the loop body may have changed because the injection of an “asm volatile” section reduced the degrees of freedom, possibly limiting code motion including load scheduling.

Well I had to step out for a bit, but I’m not having any more luck with the instruction. To do this properly I’ll need to write some assembly, but I don’t really have the time for that at the moment. I think I’ll shelve this for a later time.