Bug with integer division?

Hi,

I have found a strange behaviour in our CUDA-code. Up-to-now I haven’t succeeded in creating a small code snippet, which reproduces the error. Therefore I just want to ask if someone perhaps could imagine where the problem may be located.

In short, our code performs an integer division:
c = a / b
where all variables are of type int.
This division works in nearly all cases as expected. However, we find a situation where this division gives the wrong result (753 / 251 = 4, instead of 753 / 251 = 3).

This problem is really strange and I think we can state that there are no race-conditions at this part of the code, since these are either local variables or variables which never change.

We could circumvent this problem by transforming the division to a floating-point division:
c = a * float(1.0f/b)
With this modification the result is correct (c=3).

Our suspicion is, that perhaps(?) there is a problem with integer divisions, if the division is exact (without rest), since in all other cases we haven’t observed this problem. However, I can hardly imagine this to be a compiler bug? My hope is, that someone at least may have an idea what may be the cause of this problem?

ADDITION:
We now succeeded in reproducing this problem:
(notice: block_idx = 0, since the 753 % 251 = 0.
Hence, (block_id_2d - block_idx) / blks_x = 753 / 251 = 3, but is 4.)

#include <stdio.h>

__global__ void div_issue( int blocks_x, int* block_offset)
{
	int blks_x = blocks_x;

	//number of block 2d
	int block_id_2d = block_offset[0]; 

	//x-coordinate of block in absolute grid
	int block_idx = block_id_2d % blks_x;
	
	//y-coordinate of block in absolute grid	
	int block_idy = (block_id_2d - block_idx) / blks_x;

	printf("%d mod %d = %d \n", block_id_2d, blks_x, block_idx);
	printf("%d / %d = %d \n", block_id_2d - block_idx, blks_x, block_idy);

}

//int block_idy = int(block_id_2d - block_idx) * float(1.0f/blks_x);

int main(int argc, char *argv[])
{
	for (unsigned int i=0; i < 2; i++)
	{
		cudaSetDevice(i);
		cudaDeviceProp prop;
		cudaGetDeviceProperties(&prop, i);
		printf("using device %s :\n\n", prop.name);
	
		int block_offset_host[1];
		block_offset_host[0] = 753;
		int* block_offset_dev;
		cudaMalloc(&block_offset_dev, sizeof(int));
		cudaMemcpy(block_offset_dev, block_offset_host, sizeof(int), cudaMemcpyHostToDevice);

		div_issue <<<1, 1 >>>( 251 , block_offset_dev);

		cudaDeviceSynchronize();

		printf("\n");
	}
}

“if the division is exact”

and d = a % b is reported as 0…?

I believe you really need to state your GPU hardware, CUDA toolkit revision, driver revision to give us a solid basis for reproducability.

Yes, the modulo = 0

(is the code executed by the) host or device?

i doubt whether it would be hardware related
thus, if (even) a debug-versioned simple kernel can not reproduce the error, it is likely related to the way the code is interpreted (compiled)
in that case, i think i can think of a number of ways to then debug or spot the point of departure/ confusion

On the GPU, 32-bit integer division is emulated. The emulation code is injected at the PTX level, and is also inlined for some architectures, meaning instructions from the emulation sequence can mix with surrounding code. This means the instructions of the emulation sequence participate in code transformations applied by PTXAS. So there is a possibility that the code could get mangled in some way. I think in absence of source code it is also a possibility that the problem has been misdiagnosed and the issue is elsewhere, not with the division. Use of a custom emulation will also cause code changes beyond the division itself, e.g. different instruction scheduling and register allocation.

PTXAS code transformations will differ by compiler version and GPU architecture, so for anybody to reproduce the problem would require a buildable/runnable piece of repro code and information on the compiler version and the compiler command line. I understand that, depending on how small you have managed to make the repro code, you may not want to post it here; but you will definitely need it to file a bug with NVIDIA.

In a quick test with CUDA 6.5 (GTX 9xx edition) and sm_20 using division of two ‘int’ variables with the indicated values I got the result 3.

There may be cases where you have enough information ahead of time (and that the compiler isn’t aware of) where it makes sense to manually compute the magic numbers you need on the host and pass them as params to the kernel.

You can find magic number generation code here:
http://www.hackersdelight.org/hdcode.htm

For example a python version that’s easy to play with:
http://www.hackersdelight.org/hdcodetxt/magicgu.py.txt

This is the same emulation technique that njuffa was referring to that the compiler does for you. But basically division is turned into a multiply and a shift operation.

I was actually not referring to a division by constant which the CUDA compiler (like pretty much any modern compiler) transforms into a short sequence consisting mostly of multiplication and right shift.

I assumed that the division in question here is a division where both operands are compile-time variable. While PTX has integer division instructions, it is best to think of them as macros that get expanded into emulation sequences; these sequences then are translated into SASS by PTXAS. If the emulation sequence is inlined the resulting SASS instructions can also mingle with SASS instructions from surrounding code. Example:

__global__ void kernel (int a, int b, int *c)
{
    *c = a / b;
}

code for sm_20
        Function : _Z6kerneliiPi
.headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)
/*0000*/         MOV R1, c[0x1][0x100];
/*0008*/         I2I.S32.S32 R2, |c[0x0] [0x24]|;
/*0010*/         I2F.F32.U32.RP R0, R2;
/*0018*/         MUFU.RCP R0, R0;
/*0020*/         IADD32I R0, R0, 0xffffffe;
/*0028*/         F2I.FTZ.U32.F32.TRUNC R0, R0;
/*0030*/         IMUL.U32.U32 R3, R2, R0;
/*0038*/         I2I.S32.S32 R4, -R3;
/*0040*/         I2I.S32.S32 R3, |c[0x0] [0x20]|;
/*0048*/         IMAD.U32.U32.HI R4, R0, R4, R0;
/*0050*/         IMUL.U32.U32.HI R0, R4, R3;
/*0058*/         MOV R4, c[0x0][0x20];
/*0060*/         IMAD.U32.U32 R3, -R2, R0, R3;
/*0068*/         LOP.XOR R4, R4, c[0x0][0x24];
/*0070*/         ISETP.LE.U32.AND P0, PT, R2, R3, PT;
/*0078*/         ISETP.GE.AND P1, PT, R4, RZ, PT;
/*0080*/         LOP.PASS_B R4, RZ, ~c[0x0][0x24];
/*0088*/     @P0 ISUB R3, R3, R2;
/*0090*/     @P0 IADD R0, R0, 0x1;
/*0098*/         ISETP.GE.U32.AND P0, PT, R3, R2, PT;
/*00a0*/         MOV R2, c[0x0][0x28];
/*00a8*/         MOV R3, c[0x0][0x2c];
/*00b0*/     @P0 IADD R0, R0, 0x1;
/*00b8*/         ISETP.NE.AND P0, PT, RZ, c[0x0][0x24], PT;
/*00c0*/    @!P1 I2I.S32.S32 R0, -R0;
/*00c8*/         SEL R0, R4, R0, !P0;
/*00d0*/         ST.E [R2], R0;
/*00d8*/         EXIT;

That’s right, I was looking into integer division a while ago and forgot the compiler does more than just magic number math. Magic numbers ended up being all I needed.

From above it appears that the MUFU.RCP is doing most of the work in floating point, with a bunch of other code to handle edge cases (I think).

In my own code I can do a division and a modulus in 3-4 instructions (depending on the size of arguments):

// t =  rst / RS
// rs = rst % RS
XMAD t, magic_mul_RS, rst, RZ;
SHR.U32 t, t, magic_shift_RS;
VMAD.U16.U16 rs, -t, RS, rst;

The catch is RS (the divisor) above has to be known at kernel invocation time and its value and its magic numbers passed in as params.

If the computed value of magic_mul_RS could be potentially bigger than a 16bit value, then I’d need to add an additional XMAD.PSL instruction. I also assume rst is a 16bit value here. I use VMAD for the modulus since it supports negating the operands (which I don’t think xmad does).

[Argh, clunky forum software ate my post on the first try] Since MUFU.RCP produces a result that is only accurate to about 23 bits, which isn’t sufficient for a general 32-bit division, the emulation code applies on iteration before doing the final rounding (the conditional operations at 0x88 and 0x90 in the code above). In addition the emulation code above is for a signed division, the sequence for the unsigned 32-bit division is a bit shorter.

If memory serves, some compilers have been enhanced to the point where they perform the computation of magic multiplier and shift factor on the fly. If they find, in a loop, a loop-invariant division with compile-time variable divisor they check the estimated trip count of the loop. If it is high enough to amortize the overhead, they insert a code block prior to the loop that computes the magic shift multiplier and shift count, then use division-by-magic-multiplication inside the loop. Computing the multiplier and shift count efficiently but rigorously correct is reasonably hard, but a problem that only needs to be solved once.

If I recall correctly, similar techniques could be applied to floating-point division, provided the hardware implements an FMA instruction. I don’t know any compiler which does that yet, however.

Generally, the more is known about the ranges of dividend and divisor the more one can optimize integer divisions. But there is usually no workable way of making that information available to compilers, so programmers have to apply those optimizations at source level.

If cuda is truely this crappy at integer math, then why does nvidia try and provide integer support at all ? Why not leave it out completely ?

Now it seems nvidia is creating unnecessary problems/headaches and time loss for programmers, not a good idea in general.

In what sense do you consider the NVIDIA GPUs “crappy” at integer math?

Thanks a lot for the many reactions…some additional info to my setup.

  • Cuda - 7.0 Runtime
  • VisualStudio 2012
  • GTX 980 Ti
  • Driver: 9.18.13.5306 WHQL

Furthermore, i should mention that this problem only occurs on our newest graphics card (GTX 980 Ti). Using the same binary on one of our older cards, works perfectly fine.

Do your kernels specifically target Compute 5.x architecture or have they been built for older Compute models?

we now can reproduce the problem with a small example - I have added it to the entry-post.
Please take a look or check it on your machine…

That is a nice repro program. Works perfectly fine (that is, without error) for me, but then I have a setup very different from yours. What is the exact nvcc command line used to compile this code?

As an added check, you could try adjusting the PTXAS optimization level. The default is -O3. So try -Xptxas -O2, then -Xptxas -O1, then -Xptxas -O0. If the issue disappears as you are ratcheting down the optimization level, that would be a pretty good indication for a PTXAS bug. In which case you would want to file a bug with NVIDIA using the bug reporting form linked from the CUDA registered developer website.

We’ve tried to repro this on CUDA 5.0 with a Geforce 960 Maxwell device. No luck so far.

EDIT: I incorrectly stated this was a Kepler part. Nope, it’s Maxwell.

In a question regarding the same issue on Stackoverflow, it seems that GTX 980 is failing while a Titan Black in the same machine is passing: http://stackoverflow.com/questions/31452510/what-causes-division-error-in-this-cuda-kernel. If there is a code generation issue at the core of this, it is likely to be limited to Maxwell parts, as PTXAS code generation is architecture specific. The problem may not even be in actual the instructions themselves, but it could be something that affects the op-steering information that is added to each three-instruction block on Maxwell.

FWIW, I am getting the correct results on my setup:
using device GeForce GTX 770M :

753 mod 251 = 0
753 / 251 = 3

Running CUDA5.5 (770M is a Kepler chip).

Here is my output for GTX 980 and GTX Titan X CUDA 6.5 Windows 7 x64:

using device GeForce GTX TITAN X :

753 mod 251 = 0
753 / 251 = 3

using device GeForce GTX 980 :

753 mod 251 = 0
753 / 251 = 3