Code works under emulation, but fails on the device

PTThompson · July 30, 2009, 5:22pm

Ok, so I’ve got code that compiles fine for both device and emulation, and works correctly under emulation, but in device mode it spits out incorrect data.

Proper output when in emulation mode (compiled with ‘nvcc main.cu -arch=sm_13 -o test8 -deviceemu’):

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Device 3: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

[/codebox]

EDIT

I’ve removed the files since I’ve resolved the issue, and have updated the code accordingly. For what it’s worth, the problem was that the code (generated by f2c, no less) changed pointers so that they no longer pointed to legal memory addresses, and when these were fed back to the kernel on the second and later iterations, it barfed.

/EDIT

I’m seriously kinda losing my mind over this program. :/

Thanks much,

Paul

PTThompson · July 30, 2009, 8:29pm

Ok, this is odd. It looks like it’s not even getting into the kernel on the second call to the kernel. I set up some error codes and passed an int back and forth, and they prove it. WTF is going on here?

EDIT I thought for a moment that it might be due to the asynchronous execution of kernel calls, so I threw in a __syncthreads(); call, but that didn’t change anything. External Image <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

tmurray · July 30, 2009, 8:46pm

check the return value of cudaGetLastError immediately after the kernel launches.

PTThompson · July 30, 2009, 11:02pm

int error = cudaGetLastError(); and
int error2 = cudaThreadSynchronize();

right after the kernel call both give a value of zero for the first itteration, and 4 for all subsequent iterations. That’s a ‘cudaErrorLaunchFailure’, isn’t it? If that’s right, then that really doesn’t tell me anything new…

Topic		Replies	Views
emulation in version 2.1 CUDA Programming and Performance	0	4823	April 23, 2009
deviceQuery and deviceQueryDrv pass other CUDA programs fail CUDA Setup and Installation	3	1856	November 13, 2013
strange behavior with device emulation CUDA Programming and Performance	5	2693	May 20, 2008
I don't understand the execution time (k40c & GTX580). CUDA Programming and Performance	9	2459	April 23, 2015
pi cuda CUDA Programming and Performance	1	4820	March 28, 2011
pi cuda CUDA Programming and Performance	1	2049	March 27, 2011
different output when compiled for emulation, device, and device with -g -G CUDA Programming and Performance	7	2988	October 26, 2009
Results of running "deviceQuery" on Amazon EC2 GPU Instance Output of running the command de CUDA Programming and Performance	0	12874	February 15, 2011
CUDA error at bandwidthTest.cu for GeForce GTX 660M CUDA Setup and Installation	5	4571	September 10, 2013
Problem using multiple device 2 Tesla C1060 and 1 Quadro FX 1700 CUDA Programming and Performance	0	5630	July 21, 2010

Code works under emulation, but fails on the device

Related topics