Code works under emulation, but fails on the device

Ok, so I’ve got code that compiles fine for both device and emulation, and works correctly under emulation, but in device mode it spits out incorrect data.

Proper output when in emulation mode (compiled with ‘nvcc main.cu -arch=sm_13 -o test8 -deviceemu’):

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 3: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

[/codebox]

EDIT

I’ve removed the files since I’ve resolved the issue, and have updated the code accordingly. For what it’s worth, the problem was that the code (generated by f2c, no less) changed pointers so that they no longer pointed to legal memory addresses, and when these were fed back to the kernel on the second and later iterations, it barfed.

/EDIT

I’m seriously kinda losing my mind over this program. :/

Thanks much,

Paul

Ok, this is odd. It looks like it’s not even getting into the kernel on the second call to the kernel. I set up some error codes and passed an int back and forth, and they prove it. WTF is going on here?

EDIT I thought for a moment that it might be due to the asynchronous execution of kernel calls, so I threw in a __syncthreads(); call, but that didn’t change anything. :argh: <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

check the return value of cudaGetLastError immediately after the kernel launches.

int error = cudaGetLastError(); and
int error2 = cudaThreadSynchronize();

right after the kernel call both give a value of zero for the first itteration, and 4 for all subsequent iterations. That’s a ‘cudaErrorLaunchFailure’, isn’t it? If that’s right, then that really doesn’t tell me anything new…