Hi,
i am running into a problem when compiling cuda c code with
nvcc option -G0 in order to use the nvidia nsight tool to debug the kernel remotely.
When compiling the .cu file without -G0 ptxas is reporting that the function is using 51 registers and an amount of
128 + 0 bytes of local memory. When turning on the flag however the reported number of registers is now 26 Register
and an amount of 92896 + 0 bytes of local memory which is kind of a huge increase i think…
Of course ptxas is now reporting an error because the maximum size of local memory is 16384 byte so the compilation process is terminated.
Any suggestions or general hints on how to reduce the amount of memory that is needed for debug symbols (or whatever else is requiring this much space) are greatly welcome.
I’m kind of stuck at the moment.
System configuration:
Cuda 3.0
Windows 7 x64
Visual Studio 9 (2008) x64
Driver Version etc. should not be relevant because i’m compiling on the host machine (from nsight’s point of view) however
the compiler option -arch=sm_13 is used to instruct ptxas to build for target device with compute capability of 1.3
I’m not quite sure what kind of information one would exactly need in order to post a suitable answer to my question so
let me know if you need any specific info.
Thanks in advance.
Tobi
[EDIT]
I narrowed down the problem to a switch case block. When leaving out this block, the required memory with debug symbols is 1680 bytes.
However at the moment i don’t see any other way to implement this. I also tried to replace the switch case block with an if-else but as expected
it has the same result that is requiring again about 92896 bytes of local memory…