different output when compiled for emulation, device, and device with -g -G

My code is producing different output when it is compiled in emulation mode, in device mode, and in device mode with the -g -G debugging flags. The output is very very close to the known correct output when in emu and device with debug info modes, but on the 11th and 12th iterations is way off when run in plain old device mode. (emu and emu with -g -G produce identical output)

Emulation:

[codebox]There are 4 devices supporting CUDA

Device 0: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 3: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

[/codebox]

I’ve tried using the linux debugger, but it keeps crashing on me, so I’ve been unable to step through the code when it’s running on the device. At any rate though, it seems really odd to me that including the debug info should change the results (the results are consistent between runs), and really makes me think I’ve screwed something up somewhere. Anyone have any ideas as to what might be causing this behavior? :blink:

Thinking that this might be caused by compiler optimization, I’ve tried compiling with:

$ nvcc main.cu -arch=sm_13 -o testnoopt.o -O0

and

$ nvcc main.cu -arch=sm_13 -o testnoopt.o -Xptxas -O0

and even

$ nvcc main.cu -arch=sm_13 -o testnoopt.o -Xptxas -no-bb-merge

But no difference. It does seem to be the -G flag that makes the difference though. Are there any other optimizations that I can try turning off?

so noone has any idea why putting device debug info into the build results in correct output, while excluding it results in wrong output, or how to overcome the problem?

/me pokes NVIDIA with a stick…

I’ve gone through the best practices guide, the debugger guide, programming guide, etc etc, I’ve tried turning off optimization to both the host compiler and to the ptx assembler, and nothing seems to be working. I did a verbose build both with and without the device debug info, compared the two, and was unable to find any extra flags that would do anything to solve the problem.

I’ve pretty much run out of ideas as far as trouble shooting goes, so some good ole NVIDIA guidance would be much appreciated.

Sorry to keep bugging the forum about this.

Thanks,

Paul

Did you try the following?

  • Run nvcc with the --dryrun option, with and without -G. This will print all the intermediate compilations commands than nvcc uses.
  • Note the differences between the command-line options of both versions (especially during the nvopencc and ptxas phases).
  • Then run each commands manually in order, trying various combinations of parameters until you find exactly which compiler flag causes the difference in behavior.

Good luck. ;)
(My bet is that the culprit is the -O0 flag to nvopencc.)

Currently, -g -G and -O0 are not the same thing in terms of what the compiler will generate. I’ll ping some people to see if they can compare the code generated from -O0 and -g -G for hints, but it’s far more likely that they will murder me for showing them a CUDA kernel generated primarily through f2c.

If you can narrow things down to a repro case that doesn’t have a 196KB device function associated with it, that would probably be really helpful…

tmurray: Haha, sorry about that. I hated to use f2c, but it was my last resort when my efforts at doing a quality conversion showed signs of taking about ten times as much time as I had left for the project. Fortunately, I was able to strip all the f2c dependancies out, but there are still those GOTOs… I will see if I can locate the offending portion and strip everything else out so we have a good example.

Sylvain: I’ll go ahead and try your suggestion, though I expect it won’t give me any more info than when I did the verbose output and compared them. Neither sending -O0 to nvcc, nor to pbxas made any difference. the only thing I’ve been able to identify as significant is the -G flag. I’ll definitely give it a shot though.

Thanks both for your responses,

Paul

EDIT

Problem solved. It turned out to be the optomization in nvopencc. When I added -Xopencc -O0, it gives correct output. I’m going to keep playing with it to see if I can use O1, O2, or O3 without the output going wonky, but this is good regardless.

Thanks again for your help, guys. :)

Dear PTThompson,

Thanks a lot for the solution, it helped me a lot. :thumbup: But I wonder, should such behavior be considered as bug in cuda somewhere or bug in the code we run?

how do I delete a post (this one)? )