I know, in fact I use also the -deviceemu parameter. Anyway I have just tried your solution and this is the result:
nvcc […] -deviceemu -g -G main.cu → same problem
nvcc […] -deviceemu -G main.cu → I am not able to set a breakpoint in the kernel code and also I cannot step into the execution of the kernel
Note that I don’t use cuda-gdb, but standard gdb, because the program runs on my pc and I’ve read that cuda-gdb completely blocks the execution of the video card during the debugging process.
Sorry @ all
I don’t know how to reproduce my problem. It only appears in the code that I’m working on. It’s still a work in progress and I’m not allowed to publish the source code.
I can only say that it works with many local registers (ptxas says 22 during the last compile), one fixed-size three-dimensional shared memory per block and one size-variable grid of blocks.
I know that probably this is useless to catch the problem, but I can’t say anything else.