wrong result with release build (cvt.u32.u8 missing)

Hello,

I get correct results in DEBUG build, but wrong results in optimized RELEASE build.

Below you can find the source and both PTX files of a minimized demo kernel that shows the effect (of course my real kernel is much bigger).

The release build will show the wrong output 10h.

The debug build will show the correct output 00h.

It seems that the following code is ignorred in the release build, which seems to cause the error:

byte = byte & 0xFF;

It seems “cvt.u32.u8” is missing in the release ptx in front of the shift right (shr) command.

I’m use “GeForce GTX 470” and the latest “SDK 3.2”.

Is this a compiler bug or have I done something wrong?

//PTX release build (will show the wrong output 10h)

	.entry _Z6KernelPh (

		.param .u32 __cudaparm__Z6KernelPh_d_A)

	{

	.reg .u16 %rh<6>;

	.reg .u32 %r<3>;

	.loc	28	3	0

$LDWbegin__Z6KernelPh:

	.loc	28	8	0

	ld.param.u32 	%r1, [__cudaparm__Z6KernelPh_d_A];

	ld.global.u8 	%rh1, [%r1+0];

	add.u16 	%rh2, %rh1, 128;

	mov.u16 	%rh3, %rh2;

	shr.u16 	%rh4, %rh3, 4;

	st.global.u8 	[%r1+0], %rh4;

	.loc	28	9	0

	exit;

$LDWend__Z6KernelPh:

	} // _Z6KernelPh
//PTX debug build (will show the correct output 00h)

	.entry _Z6KernelPh (

		.param .u32 __cudaparm__Z6KernelPh_d_A)

	{

	.reg .u16 %rh<3>;

	.reg .u32 %r<10>;

	.loc	28	3	0

$LDWbegin__Z6KernelPh:

$LDWbeginblock_203_1:

	.loc	28	6	0

	ld.param.u32 	%r1, [__cudaparm__Z6KernelPh_d_A];

	ld.global.u8 	%r2, [%r1+0];

	add.u32 	%r3, %r2, 128;

	cvt.u8.u32 	%r4, %r3;

	cvt.u8.u32 	%rh1, %r4;

	.loc	28	7	0

	cvt.u32.u8 	%r5, %rh1;

	cvt.u8.u32 	%rh1, %r5;

	.loc	28	8	0

	cvt.u32.u8 	%r6, %rh1;

	shr.u32 	%r7, %r6, 4;

	ld.param.u32 	%r8, [__cudaparm__Z6KernelPh_d_A];

	st.global.u8 	[%r8+0], %r7;

$LDWendblock_203_1:

	.loc	28	9	0

	exit;

$LDWend__Z6KernelPh:

	} // _Z6KernelPh
//.cu src file

#include <stdio.h>

__global__ void Kernel(unsigned char* d_A)

{

	unsigned char byte;

	byte = d_A[0] + 0x80;

	byte = byte & 0xFF;

	d_A[0] =  byte >> 4;

}

int main(int argc, char** argv)

{

	unsigned char *h_A;

	unsigned char *d_A;

	int N = 1;

    size_t size = N * sizeof(unsigned char);

h_A = (unsigned char*)malloc(size);

    cudaMalloc((void**)&d_A, size);

	h_A[0] = 0x80;

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    Kernel<<<1, 1>>>(d_A);

    cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);

	printf("Result: %02Xh\n", h_A[0]);

	cudaFree(d_A);

	free(h_A);

	getchar();

}

unsigned char is 0…255 range, so byte = byte & 0xFF is pointless. And compiler is too smart and optimizes byte = byte & 0xFF cause byte is unsigned char. Looks like 32 bit register is assigned to unsigned char by compiler in debug and release mode, that is error. Workaround is to use 32 bit byte, however it looks like compiler bug anyway.
Interesting, id you write byte=byte&0xf0 will compiler remove it or not?

The “byte=byte&0xf0” will be keept in release build and the result 00h is correct.

shl.b32 	%r4, %r3, 24;

shr.u32 	%r5, %r4, 28;

The problem is that my real kernel is very big and use many of different bit shifts on unsigned char arrays. So I hope a NVidia employee can confirm the compiler bug and it will get fixed in the next version.

You do not understand, result is incorrect with 0xf0 in general, cause it is not 0xff, but compiler does not optimize it. You should get rid of char variables, it is pointless, anyway registers are 32 bit. Just do not use unsigned char etc.
And you have overflow error in the code anyway, and byte=byte&0xff does not fix it, it is unnecessary.
Btw, I suppose compiler may generate correct code with out byte=byte&0xff, it confisues compiler.

Btw, I suppose compiler may generate correct code with out byte=byte&0xff, it confisues compiler.

Without “byte=byte&0xff” (release build) the output is wrong.

__global__ void Kernel(unsigned char* d_A)

{

	unsigned char byte;

	byte = d_A[0] + 0x80;

	d_A[0] =  byte >> 4;

}
//PTX release build (will show the wrong output 10h)

	.entry _Z6KernelPh (

		.param .u32 __cudaparm__Z6KernelPh_d_A)

	{

	.reg .u16 %rh<6>;

	.reg .u32 %r<3>;

	.loc	28	3	0

$LDWbegin__Z6KernelPh:

	.loc	28	7	0

	ld.param.u32 	%r1, [__cudaparm__Z6KernelPh_d_A];

	ld.global.u8 	%rh1, [%r1+0];

	add.u16 	%rh2, %rh1, 128;

	mov.u16 	%rh3, %rh2;

	shr.u16 	%rh4, %rh3, 4;

	st.global.u8 	[%r1+0], %rh4;

	.loc	28	8	0

	exit;

$LDWend__Z6KernelPh:

	} // _Z6KernelPh

result is incorrect with 0xf0 in general

I agree that &=0xf0 is only a workaround for this specific case (byte >> 4).

Just do not use unsigned char etc.

I agree that not using types with less then 32 bit (char, short, …) will not cause the error, but as long they are officially supported by Nvidia, the compiler should produce correct results.

yes, this looks like compiler bug, you are right.