Bug with Pointer Arithmetic EmuDebug/EmuRelease results don’t match debug/Release Results

I have come across the following issue when using pointer arithmetic and increment operators on the device. In C compiled on the host the following line of code has the result of doing nothing. It simply assigns the location pointed to by p by the value already stored at that location. then increments the pointer and proceeds…

for(i = 0; i < length_of_p_data;i++){

 *p++ = *p;

}

where p is a pointer to some data of length length_of_p_data.

However, when code like this is executed on the device it has the following effect.

for(i = 0; i < length_of_p_data;i++){

 p[i] = p[i+1];

}

If the device code is compiled in Emu, however, the code has the expected effect.

for(i = 0; i < length_of_p_data;i++){

 p[i] = p[i];

}

The following is a test program that illustrates the problem in a somewhat roundabout way. The program should simply assign the elements of g_odata with their index. The problem only occurs when the same line both reads a memory location and assigns to the same location in the same line of code.

This code simply calls a kernel with 1 thread and one block and has it assign an array of 32 elements with their corresponding index.

I’m not sure if this is supposed to be defined behavior in C or if it is undefined in the C standard but the results are always the same when run on the CPU and are always consistently different when run on the GPU.

[codebox]

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil.h>

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

global void

testKernel(float* g_odata) // We launch 32 threads and want each threadto process several elements in a loop

{

int i;

float temp_data[32];

//initialize locla memory;

for(i = 0; i < 32; i++){

	temp_data[i] = i;

}

float *p_odata = &temp_data[0];

for( i = 0; i < 32; ++i) 

{

    *p_odata++ = *p_odata;

}

p_odata = &temp_data[0];

for(i = 0; i < 32; ++i) 

{

    g_odata[i] = p_odata[i];

}

}

void

computeGold( float* reference)

{

//perform the computation with pointer arithmatic

int i;

float temp_data[32];

for(i = 0; i < 32; i++){

	temp_data[i] = (float)i;

}

float *p_odata = &temp_data[0];

for( i = 0; i < 32; ++i) 

{

    *p_odata++ = *p_odata;

}

p_odata = &temp_data[0];

for(i = 0; i < 32; ++i) 

{

    reference[i] = p_odata[i];

}

}

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv)

{

runTest( argc, argv);

CUT_EXIT(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv)

{

CUT_DEVICE_INIT(argc, argv);

unsigned int num_threads = 1;

unsigned int num_elements_per_thread = 32;

unsigned int mem_size =  num_elements_per_thread*num_threads*sizeof(float);

// allocate device memory for result

float* d_odata;

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));

// setup execution parameters

dim3  grid( 1, 1, 1);

dim3  threads( num_threads, 1, 1);

// execute the kernel

testKernel<<< grid, threads>>>(d_odata);

// check if kernel execution generated and error

CUT_CHECK_ERROR("Kernel execution failed");

// allocate mem for the result on host side

float* h_odata = (float*) malloc( mem_size);

// copy result from device to host

CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size,

                            cudaMemcpyDeviceToHost) );

// compute reference solution

float* reference = (float*) malloc( mem_size);

computeGold( reference);

// check if the result is equivalent to the expected soluion

CUTBoolean res = cutComparef( reference, h_odata, num_threads*num_elements_per_thread);

printf( "Test %s\n", (1 == res) ? "PASSED" : "FAILED");

// cleanup memory

free( h_odata);

free( reference);

CUDA_SAFE_CALL(cudaFree(d_odata));

}

[/codebox]

I am using cuda 2.0 in Windows XP 32bit SP3 on the following device :

[codebox]There is 1 device supporting CUDA

Device 0: “Quadro FX 3600M”

Major revision number: 1

Minor revision number: 1

Total amount of global memory: 536543232 bytes

Number of multiprocessors: 16

Number of cores: 128

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.25 GHz

Concurrent copy and execution: Yes

Test PASSED

Press ENTER to exit…[/codebox]

can you try this with CUDA 2.2 or 2.3?

Sure thing…

I have compiled and tested the above code in release mode on Linux with cuda 2.2 using the following device:

[codebox]There is 1 device supporting CUDA

Device 0: “GeForce 9800 GT”

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 536150016 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.51 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…[/codebox]

And the test does not pass and the results still do not match the cpu version of the code.

Are you able to reproduce the problem on any of your machines??

I can reproduce this on 2.3 and on Ocelot. This suggests to me that it is a compiler bug.

EDIT: The bug looks like it is in nvcc rather than the JIT compiler:

.entry _Z3fooPij (

		.param .u64 __cudaparm__Z3fooPij___val_paramp,

		.param .u32 __cudaparm__Z3fooPij_elements)

	{

	.reg .u32 %r<8>;

	.reg .u64 %rd<3>;

	.reg .pred %p<4>;

	.loc	15	2	0

$LBB1__Z3fooPij:

	ld.param.u32 	%r1, [__cudaparm__Z3fooPij_elements];

	mov.u32 	%r2, 0;

	setp.eq.u32 	%p1, %r1, %r2;

	@%p1 bra 	$Lt_0_1282;

	mov.s32 	%r3, %r1;

	ld.param.u64 	%rd1, [__cudaparm__Z3fooPij___val_paramp];

	mov.s32 	%r4, 0;

	mov.s32 	%r5, %r3;

$Lt_0_1794:

 //<loop> Loop body line 2, nesting depth: 1, estimated iterations: unknown

	.loc	15	6	0

	ld.global.s32 	%r6, [%rd1+4];

	st.global.s32 	[%rd1+0], %r6;

	add.s32 	%r4, %r4, 1;

	add.u64 	%rd1, %rd1, 4;

	setp.ne.u32 	%p2, %r4, %r1;

	@%p2 bra 	$Lt_0_1794;

$Lt_0_1282:

	.loc	15	8	0

	exit;

$LDWend__Z3fooPij:

	} // _Z3fooPij

In particular

ld.global.s32 	%r6, [%rd1+4];

	st.global.s32 	[%rd1+0], %r6;

Should be

ld.global.s32 	%r6, [%rd1+0];

	st.global.s32 	[%rd1+0], %r6;

I think…

That’s really interesting. It looks to me like you have hit it on the nose. Any comment from NVIDIA?

yeah, I did repro it. no ETA on a fix yet.

I thought the order of evaluation of operands of individual operators was undefined. Ie the standard does not specify whether the lvalue in your assignment should be evaluated before the rvalue, so you can get either behavior (or any behavior, really). Perhaps it’s good that you get different results, since that’s what you can get if you run on different architectures, because otherwise you could blissfully continue to write undefined code until something changes and it blows up?

yeah, I thought I replied to this again earlier but I guess the forum ate my reply. the compiler team has informed me that the ordering there is undefined according to the C spec, so you shouldn’t really depend on this behavior in the first place.

I don’t believe this is true, the order of operators is defined by the precedence of the operator as defined by the c standard. In this case as the ++ (post increment) has higher precedence then the dereference operator and will therefore always be associated with the rvalue, however, it is a post increment so the original value of the pointer will be used for the assignment and the rvalue will then be incremented. The problem here, I believe, is that it may be undefined “when” the rvalue gets incremented. The c standard seems to state that it can happen sometime before the next semi-colon and after the value is read, not necessarily after all other values or expressions on that line are evaluated, which could lead to the observed behavior.

Adam