I’m trying to optimize some GPU code. The nvcc compiler reports 36 bytes of constant memory usage. Now I know that (please correct me if I’m wrong) the latency when accessing constant is very much about the same with the latency of global memory (roughly 400-600 cycles), the only difference between constant memory and global memory is that there is some constant cache.
I use in my a kernel a TILE_DIM constant (which has the value 16). During the code I also need the value of TILE_DIM-1 (which I calculate in the code). Unfortunately the compiler also puts this value in constant memory (probably because the value is the same for all threads). This fact increases the number of 32b reads from about 150 to 550 (gld32 reported by the Visual Profiler). I think (know) that the kernel would run faster if the value of TILE_DIM-1 would be stored in a register after being calculated from the constant TILE_DIM (I have register space left so the increase in register usage doesn’t influence the occupancy).
I’m pretty sure that the kernel would run faster because my application is bandwidth limited and not compute limited.
I’m using Visual Studio 2005, I’ve seen that in the build rule I can chose between four optimization options, but all four of them yield the same result. Further, I can’t really understand/modify the ptx code so as to control the constant memory usage…
Does anyone have an idea?
Thanks really much!