nvcc compiler bug nvcc generates incorrect code when accessing int[] through char*

Nvcc seems to generate incorrect code when accessing an int array through a char pointer:

__global__ void mixedaccesskernelok(int *result)

{

	int ai[] = { 0x01020304 };

	char* ac = (char*)ai;

	*result = ac[0] + ac[1] + ac[2] + ac[3];

}

__global__ void mixedaccesskernelbug(int *result)

{

	int ai[] = { 0x01020304 };

	char* ac = (char*)ai;

	int s = 0;

	for (int i = 0; i != sizeof(ai); i++) 

		s += ac[i];

	*result = s;

}

mixedaccesskernelok and mixedaccesskernelbug should give the same result “10”. The former does, but the latter returns 0x01020304.

When looking at the generated PTX code, I see strange operations like “shl.b32 %r4, %r2, -8;”, i.e. shifts by a negative amount. I assume the shifts are required because the loop is unrolled and ai placed in a register.

The output does not change if I prevent the compiler from unrolling the loop with “#pragma unroll 1”, although ai is placed in local memory and I cannot spot an error in the generated PTX code.

I compiled the attached program with

nvcc -O2 -o mixedcharintaccess  mixedcharintaccess.cu

and

nvcc -O2 --ptx  mixedcharintaccess.cu
  • Operating System: Debian Lenny amd64, 2.6.31 kernel

  • CUDA toolkit 2.3

  • SDK 2.3

  • Compiler for CPU host code: gcc version 4.3.2 (Debian 4.3.2-1.1)
    mixedcharintaccess.cu (1.22 KB)

I don’t think that is a compiler bug at all. In the second kernel, sizeof(ai) will not be equal sizeof(int) when compiled on a 64 bit machine.

In addition using char is usually not a good idea as might might be signed or unsigned AFAIK.

On my system sizeof(int) is 4 on the host as well as on the GPU. Try the attached example program to verify.

The generated PTX code is nonsense as far as I can tell:

.entry _Z20mixedaccesskernelbugPi (

		.param .u64 __cudaparm__Z20mixedaccesskernelbugPi_result)

	{

	mov.s32 	%r1, 16909060;

	mov.s32 	%r2, %r1;

	.loc	15	12	0

	mov.s32 	%r3, %r2;

	shl.b32 	%r4, %r2, -8;

	mov.u32 	%r5, %r4;

	mov.s32 	%r6, %r5;

	add.s32 	%r7, %r3, %r6;

	shl.b32 	%r8, %r2, -16;

	mov.u32 	%r9, %r8;

	mov.s32 	%r10, %r9;

	shl.b32 	%r11, %r2, -24;

	mov.u32 	%r12, %r11;

	mov.s32 	%r13, %r12;

	add.s32 	%r14, %r10, %r13;

	add.s32 	%r15, %r7, %r14;

	ld.param.u64 	%rd1, [__cudaparm__Z20mixedaccesskernelbugPi_result];

	st.global.s32 	[%rd1+0], %r15;

	.loc	15	13	0

	exit;

	} // _Z20mixedaccesskernelbugPi

Please note the shift operations with negative shift amount!
mixedcharintaccess.cu (1.55 KB)

I was meaning that pointers are size 8 on the device in 64 bit mode, and sizeof(ai) might be be resolved as 8 rather than 4. Compiling it without optimisation gets rid of the problem, so you are right it does seem to be a compiler issue.

I played a bit more with the example code for the compiler bug and now get a fatal ptxas error message. Here is the code:

__global__

void mixedaccesskernelbug2(int *result)

{

	char ac[4];

	*(int*)ac =  0x01020304;

	int s = 0;

	for (int i = 0; i < sizeof(ac); i++)

		s += ac[i];

	*result = s;

}

And here the output of ptxas:

$ nvcc -O2 --keep -o mixedcharintaccess mixedcharintaccess.cu

ptxas mixedcharintaccess.ptx, line 91; error   : Arguments mismatch for instruction 'shl'

ptxas mixedcharintaccess.ptx, line 92; error   : Arguments mismatch for instruction 'shr'

ptxas mixedcharintaccess.ptx, line 95; error   : Arguments mismatch for instruction 'shl'

ptxas mixedcharintaccess.ptx, line 96; error   : Arguments mismatch for instruction 'shr'

ptxas mixedcharintaccess.ptx, line 98; error   : Arguments mismatch for instruction 'shl'

ptxas mixedcharintaccess.ptx, line 99; error   : Arguments mismatch for instruction 'shr'

ptxas fatal   : Ptx assembly aborted due to errors

The generated ptx code:

.entry _Z21mixedaccesskernelbug2Pi (

		.param .u64 __cudaparm__Z21mixedaccesskernelbug2Pi_result)

	{

	.reg .u16 %rh<9>;

	.reg .u32 %r<10>;

	.reg .u64 %rd<3>;

	.loc	15	16	0

$LBB1__Z21mixedaccesskernelbug2Pi:

	.loc	15	19	0

	mov.s32 	%r1, 16909060;

	cvt.s8.s32 	%rh1, %r1;

	.loc	15	24	0

	cvt.s32.s8 	%r2, %rh1;

	shl.b8 	%rh2, %rh1, -32; // line 91

	shr.u8 	%rh3, %rh2, -24; // line 92

	cvt.s32.s8 	%r3, %rh3;

	add.s32 	%r4, %r2, %r3;

	shl.b8 	%rh4, %rh1, -40; // line 95

	shr.u8 	%rh5, %rh4, -24; // line 96

	cvt.s32.s8 	%r5, %rh5;

	shl.b8 	%rh6, %rh1, -48; // line 98

	shr.u8 	%rh7, %rh6, -24; // line 99

	cvt.s32.s8 	%r6, %rh7;

	add.s32 	%r7, %r5, %r6;

	add.s32 	%r8, %r4, %r7;

	ld.param.u64 	%rd1, [__cudaparm__Z21mixedaccesskernelbug2Pi_result];

	st.global.s32 	[%rd1+0], %r8;

	.loc	15	25	0

	exit;

$LDWend__Z21mixedaccesskernelbug2Pi:

	} // _Z21mixedaccesskernelbug2Pi

mixedcharintaccess2.cu (1.67 KB)