I had a significant performance issue on some significantly complex code that I ported from a small piece of a supercomputer application to run on a C2050 GPU. I also had successfully ported a previous piece of the application and obtained reasonable results. After using the computeprof application, I determined that there was very little coalescing happening for accesses to local variables. This software kernel has a large amount of local variables (> 10k). I looked at the ptx files generated for the two kernels and noticed that in the one kernel that was successful, the following load and store instructions were used for the local variables:
and for the kernel that ran slowly:
So, the ld.f64/st.f64 instructions still seemed to be generating “local” accesses but in such a fashion that the data was not coalesced. I am not sure if this was due to the instruction or the generation of the data region used by the instruction. Either way, the coalescing was not occurring. So, I “hacked” in a change to reduce the size of the local variables and the output ptx file now only had ld.local.f64/st.local.f64 and no ld.f64/st.f64 instructions. This new code ran 20 times faster since it could now use coalescing. I also discovered that including printf also increases the size of local variables since by removing any calls to printf, I was also able to get nvcc to create ptx files with only ld.local.f64 instructions.
So, my questions are:
- At what size of local variables does nvcc start to generate ld.f64 instructions and cause a lack of coalescing?
- How much local variable storage does printf need?
- Is there a way to control the size when nvcc switches to this other generation method?
Any feedback would be greatly appreciated.