I have a little issue with my code (Cuda 2.1 beta, GF8600GT). If I look at the code in the profiler, I see an uncomfortable number of local memory accesses, but if I look at the PTX output, there is not a single ld.local or st.local in sight.
By playing around with -maxregisters, I can get the number of accesses to go up and down and the kernel time changes accordingly, so this doesn’t seem to be a profiler artefact. While building a minimal test-case, I discovered that this seems to have something to do with loops that the compiler chokes on. My test-code basically calls an eigenvector computation a number of times and if I let the for loop iterate only once, no local memory is used. If I iterate more than once, each iteration causes one load and 4 or more stores. My guess that the compiler is somehow choking on the loop comes from the fact that he flat-out refuses to unroll it, if I add the pragma.
If anyone has an idea what causes this behaviour, I’d be more than interested. It’s kinda unsatisfying to have to frob the number of registers till nice code is generated.
Ah yes, once more thing: there are a few bytes of local mem allocated in the cubin, so this isn’t just something that happens when the actual microcode is generated by the driver.
Yeah, it lists some local mem entries (lmem is 2 or 4, depending on maxregisters). What’s causing me trouble is that I cannot tell from the PTX where the local mem access happens. I also have no real idea how to remove it…
PTX is not machine code (even if it looks like one). The PTX manual says:
“This document describes PTX, a low-level parallel thread execution virtual machine and
instruction set architecture (ISA).”
Whether a given PTX code will use lmem may depend on the device - for example, GT 200 cards have twice the registers of G80/92 and some variables that end up in GT 200’s registers may leak to local memory in a G80. That’s why you might not be able to tell reliably what your lmem usage is by looking into PTXes. Also, PTX code is not yet the optimized code that hits your GPU - some registers might be reused and some other modifications can ensue. If you want a true register and lmem count, see .cubin files. And just to add more confusion, there are also .gpu files generated by the compiler. I don’t know what they are to be honest :)
Sure. But as I said in the original post, the compiler is already allocating local-mem for the loop variables it will spill. So this is surely not something that happens exclusively during the translation phase in the driver.
My question was more “how can I predict this” and “how can I prevent this”. Maybe it doesn’t happen on a 2xx, which would then solve my problem. As I’m working one something that will be a pretty big production system, I’m a bit concerned by the unpredictability of the compiler in this case. If I later need to do a small change later which results in a lot of spillage and a doubling of my computation time, I’m not sure I’ll be happy to double our cluster size accordingly.
When I started out, the indexing was the problem. I then moved the matrix over to shared mem, which removed those local mem accesses completely. Note that if the compiler is using local mem because of array usage, it will be listed in the PTX, as you would expect. The accesses I am seeing now are almost certainly from loop counters and do not appear in the PTX. The reason I’m thinking it is spillage is because if I set maxregisters to 64, the reg count will increase from 60 to 64 and a good chunk of the local-mem accesses go away (cutting the computation time in half, basically).
Thanks for the compiler option. Makes reading PTXs much easier.