Possible bug with unsigned 64 bit int modulo

In my code have been using fragments such as the following:

uint64 ret;

	//return  ret  % iRange[config.maxLen];

	return ret - (ret / iRange[config.maxLen]) * iRange[config.maxLen];

Where uint64 is an unsigned 64 bit long int. It seems as though when I use the modulo operator I sometimes get erratic results. For example, if a % b = c, I will sometimes get c which is larger than b, something I know to be impossible. The second line in the code fragment above does work as a modulo operation and in theory should give the same result. I do not believe I have been dealing with numbers which could be interpreted as negative.


a = 0x0000000171b7a70c

b = 0x1a

a % b =  0x00000001d1b7a726

When I manually put in constants (the numbers above) into my code, It produces the correct results. However, I think this may be due to the compiler calculating the results before runtime.

Could anyone verify this?

The test case you posted passed in our test environment. Can you post a full repro case? There might be a compiler bug.


I got the same errors when using 64bit ints. The error is hard to reproduce in a small code snippet, but I managed it:

#include <stdio.h>

__device__ __host__ unsigned int mod_error( unsigned int modulus )


	return ( unsigned int ) ( 9800140000000000000u % modulus ); // clearly divisible by 2 (9800140000000000000u is a 64bit integer)


__global__ void mod_test( unsigned int modulus, unsigned int *result )


	// dummy code - error disappears when removed

	result[ 0 ] = ( result[ 0 ] * ( unsigned long long int ) 2 ) % modulus;


	if( threadIdx.x == 0 ) // error disappears when removed or changed to "if( threadIdx.x == 1 )"

		result[ 0 ] = mod_error( modulus );


int main()


	unsigned int modulus = 2u;

	unsigned int result_host;


	unsigned int *result_device;

	cudaMalloc( ( void ** ) &result_device, sizeof( unsigned int ) );

	mod_test <<< 1, 2 >>> ( modulus, result_device );

	cudaMemcpy( &result_host, result_device, sizeof( unsigned int ), cudaMemcpyDeviceToHost );


	printf( "CPU: %u\n", mod_error( modulus ) );

	printf( "GPU: %u\n", result_host );


This code calculates 9800140000000000000 % 2, which is clearly zero. On my GeForce GTX 260 the result is 421052418 instead. If one removes the dummy code, whos result is never used, the errors disappears. Please try to figure out, if it is a compiler bug - I really need the 64bit modulo …

I am running this on a Windows Vista 32bit, CUDA 2.2 and Forceware 185.85.

yep, repro’d this on a newer compiler/driver/Linux and am seeing the same behavior. smells like a compiler bug, will pass it to the appropriate people. thanks.

edit: it is definitely a compiler bug. :)

Is it possible to get a bugfix as a “normal user”? I reported some compiler bugs a few years ago, when I was using GLSL. At that time I had to wait half a year until a new driver/compiler was released with no chance to get a preview version. As my whole future development depends on the modulo operation on 64bit ints, I am really stuck now. Emulating the behaviour with division ( a % b = a - ( a / b ) * b ) is not a very good option because of the performance drawback (I need to implement arithmetic, where the result is reduced modulo a large prime after each operation).

there might be a workaround because (to my completely untrained eye at fixing compiler bugs) it looks like a bug in optimization which has some tunable parameters. I’ll let you know.

Is there any hope, that the bug is fixed within CUDA 2.3? As far as I know CUDA 2.3 will be released this month, so I’m looking forward to the next version.

For the time being you can probably replace A%B by (A-C*(A/C)) or some bitmasking if the modulo operator is a power of 2.


Hi, I already startet to use the substitution. I hope that the performance loss is not to large. The modulus in my application is always a (dynamically choosen) prime number, so bitmasking is not an alternative.