nvcc and maximum local size - performance issue

I had a significant performance issue on some significantly complex code that I ported from a small piece of a supercomputer application to run on a C2050 GPU. I also had successfully ported a previous piece of the application and obtained reasonable results. After using the computeprof application, I determined that there was very little coalescing happening for accesses to local variables. This software kernel has a large amount of local variables (> 10k). I looked at the ptx files generated for the two kernels and noticed that in the one kernel that was successful, the following load and store instructions were used for the local variables:

ld.local.f64
st.local.f64

and for the kernel that ran slowly:

ld.f64
st.f64

So, the ld.f64/st.f64 instructions still seemed to be generating “local” accesses but in such a fashion that the data was not coalesced. I am not sure if this was due to the instruction or the generation of the data region used by the instruction. Either way, the coalescing was not occurring. So, I “hacked” in a change to reduce the size of the local variables and the output ptx file now only had ld.local.f64/st.local.f64 and no ld.f64/st.f64 instructions. This new code ran 20 times faster since it could now use coalescing. I also discovered that including printf also increases the size of local variables since by removing any calls to printf, I was also able to get nvcc to create ptx files with only ld.local.f64 instructions.

So, my questions are:

  1. At what size of local variables does nvcc start to generate ld.f64 instructions and cause a lack of coalescing?
  2. How much local variable storage does printf need?
  3. Is there a way to control the size when nvcc switches to this other generation method?

Any feedback would be greatly appreciated.

So the ld.f64 and st.f64 instructions would load and store from/to registers, whereas ld.local.f64 and st.local.f64 might not. How do you define “local memory”? Because “local” variables will be stored in core’s registers as long as there’s enough space there. Otherwise, you go to the next memory level, which on my GPU is global (-: Dunno the case with Tesla.

As to printf, its assembly code is enormous. But it should be; that’s an incredibly powerful function.

I should have been more clear in my initial description and was sloppy in the use of the word “local”. “local” memory has a specific definition within Cuda and is memory which is specific (i.e. local) to each thread. As you have mentioned, where the “local” memory resides can be anything but local (i.e. global memory). With respect to the Fermi GPU chip (one used on the C2050), the “local” memory also resides in global memory but due to the L1/L2 cache many of the accesses to “local” memory are now cached in memory that is the same access speed as the shared memory in the GPU. My use of the term “local variables” made my description of the issue less clear. So, yes variables that belong to the subroutine (i.e. local) are generally put into registers except when there are not enough registers to contain them, and then they are stored in “local” memory (term NVIDIA uses for this is register spillage). It is these accesses to “local” memory due to spillage that I am referring to above and are accessed through the ld.local.f64/st.local.f64 instructions which are transfer of “local” memory to register transactions.

So, with the clarity of my initial descriptions, here are my questions with clarifications and with an additional question:

  1. At what quantity of “local” memory used due to register spillage does nvcc start to generate ld.f64 instructions and cause a lack of coalescing?

  2. How much register spillage is there in printf, i.e. how much of the “local” memory is used by printf?

  3. Is there a way to indicate to nvcc how much “local” memory to use before nvcc switches to this other generation method?

  4. Does the inclusion of the call to the printf subroutine cause significant extra register spillage in the code that makes this call that would generate a significant increase in the requirements for “local” memory?

  1. This is hardware dependant. Number of registers per core changes with generations. For example: http://en.wikipedia.org/wiki/GeForce_400_Series#Current_limitations_and_trade-offs
  2. Dunno. I guess you can manipulate it a little bit by CU_LIMIT_PRINTF_FIFO_SIZE?
  3. I’m not aware of such a method, but I would guess it doesn’t exist, since that might lead to runtime exceptions.
  4. I do believe so, yes (-: