Chars not overflowing correctly potential compiler bug?

I recently came across unexpected behavior regarding unsigned chars not overflowing as expected. I reduced it to a simple test case where the kernel computes the “average” by adding three numbers together (100+100+100) and then dividing by 3. I put average in quotes because the sum exceeds 256 and should wrap around to 44, then the division should reduce this value to 14. What is happening instead is that the result is 100, suggesting that something is getting automatically promoted to a larger type, carrying 300 internally.

Here is the code:

[codebox]

#include <stdio.h>

static const int NN = 3;

global void overflowKernel(unsigned char *values, int n) {

unsigned char sum = 0;

for (int i=0; i < NN; i++) {

	sum += values[i];

}

// values[1] = sum;

sum /= NN;

values[0] = sum;

}

void overflowTest() {

int n = NN;

unsigned char *d_values;

cudaMalloc((void **)&d_values, n);

unsigned char *h_values = (unsigned char *)malloc(n*sizeof(unsigned char));

for (int i=0; i < n; i++) {

	h_values[i] = 100;

}

cudaMemcpy(d_values, h_values, n*sizeof(unsigned char), cudaMemcpyHostToDevice);

overflowKernel<<<1,1>>>(d_values, n);

cudaMemcpy(h_values, d_values, n*sizeof(unsigned char), cudaMemcpyDeviceToHost);

printf("values[0] = %d\n", h_values[0]);

printf("values[1] = %d\n", h_values[1]);

}

[/codebox]

This problem goes away (meaning the answer 14 is returned) if the constant NN is replaced by the variable n in the loop control, or in the division, suggesting this has to do with unrolling and optimizing that’s going on. Also, if the intermediate value prior to the division is saved, then it has the correct value of 44 and the output is correct at 14. In emulation mode it always returns the correct value 14.

Comparing the PTX might make it clear exactly what is happening here. (Sounds like a compiler bug regardless…)

Strictly speaking the result of an overflow in an expression is undefined in C, so this would not be a bug. You shouldn’t really rely on the modulo wrap around of 2’s complement arithmetic, even though it’s the most common behaviour on most conventional CPUs.

Thank you for pointing this out, I didn’t know. I learn something every day. It’s one of those dusty corners of the C specification that I never had to get familiar with.

Interestingly, according to Wikipedia and my reading of section 3.1.2.5 of the spec, signed integer overflow is indeed undefined. But apparently for unsigned, “a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type.”

It seems weird to me that one should be specified while the other is not, but apparently that’s how it is.

I am not at all sure if this is related, but I’ve been hit by a bug using unsigned char when promoting it to an unsigned int. This happens when I fetch from a texture of unsigned chars. I can’t tell yet if the unsigned chars are being sign-extended, or some other trash is showing up. This happens if I run optimized or unoptimized code, but it does not show up in device emulation. Masking with 255 makes the code work, but only if the masking is done after the assignment to the unsigned int. Also, a simple test case does not fail. I need a complex case with multiple textures to make this happen.

I’m still looking into it, but there is a possibility that this is a related bug. I can’t find the problem in the ptx, so it is possible that it occurs when the ptx is turned into real instructions. Of course, it is also quite possible that I’m just missing the problem in the ptx.

For IP reasons I can’t post the failing code until I can get a failing case that is sufficiently sanitized.

Probably the reason is that (historically, at least) not all machines use(d) 2’s complement for signed integers. So signed integers can not be guaranteed to overflow in a predictable way, whereas unsinged integers can.