Bug with integer division?

I can reproduce this issue with Windows 8.1 64bit, GeForce Driver 353.38, Visual Studio 2013, CUDA 7.5RC, both 64bit and 32bit builds, on ALL of my Maxwell cards, under all of the following code generation commands: compute_20,sm_20/compute_30,sm_30/compute_35,sm_35/compute_50,sm_50/compute_52,sm_52.

using device GeForce GTX TITAN X

753 mod 251 = 0
753 / 251 = 4

using device GeForce GTX 980 Ti :

753 mod 251 = 0
753 / 251 = 4

using device GeForce GTX 980 Ti :

753 mod 251 = 0
753 / 251 = 4

using device GeForce GTX 980 Ti :

753 mod 251 = 0
753 / 251 = 4

Press any key to continue . . .

In my case, switching to CUDA 7.0 RESOLVES the issue.

Does the issue disappear when you reduce PTXAS optimization with -Xptxas -O{2|1|0} ? In any event, it looks like enough due diligence has been applied and it is time to file a bug report with NVIDIA.

-Xptxas -O2 : Issue persists
-Xptxas -O1 : Issue disappears
-Xptxas -O0 : Issue disappears

That is good information to add to the bug report, a pretty clear indication this is a problem with PTXAS optimizations on Maxwell.

Has anyone been able to reproduce using CUDA 6.5 or less?

Bug not happening on my machine with -O2 on my EVGA Maxwell cards.

I was able to reproduce on TitanX and 7.5RC(linux). Playing with the code some, removing the prints or the modulus makes the problem go away. So njuffa is probably on the right track.

"
Since MUFU.RCP produces a result that is only accurate to about 23 bits, which isn’t sufficient for a general 32-bit division, the emulation code applies on iteration before doing the final rounding (the conditional operations at 0x88 and 0x90 in the code above). In addition the emulation code above is for a signed division, the sequence for the unsigned 32-bit division is a bit shorter"

I read this as 32 bit integers only have 23 bits of precision. Also on some cards if not all, 32 bit integers are in reality 32 bit floating points, giving reduced precision, so in a way CUDA does not seem to follow the general C standard, where a 32 bit integer is a certain integer range from -2^31 to (2^31+1).

You are reading incorrectly. 32-bit and 64-bit integers work in CUDA as they are defined to work in C++. How various integer operations are implemented “under the hood” is completely orthogonal to the standard compliance. In the case of 32-bit integer division, a fully accurate 32-bit result (modulo compiler bugs, apparently :-) is computed using the initial floating-point reciprocal from MUFU.RCP with 23 “good bits” as a starting point .

Internally there is an NVIDIA bug filed for this, apparently as a result of someone (perhaps OP on this thread) filing a bug. It appears to be evident in CUDA 7.5RC compiler and (PTX JIT mechanism). It is expected to be fixed in CUDA 7.5 production release, which is due out soon. I believe this is cross-posted here:

[url]hardware - What causes division error in this CUDA kernel? - Stack Overflow

and essentially the same information is given there.

I highly doubt the accuracy/precision of 32 bit integers in CUDA… especially as on some devices it’s not natively supported, but seems to be done with floating point numbers.

This division bug might be one of many… I doubt shl, shr, and other instructions.

Extraordinary claims require extraordinary evidence.

Fact is my ccminer and cudaminer application performed very well using CUDA for computing hash functions for crypto currencies (in select cases we beat the best available AMD mining code). Shift operations were used extensively, both on 32 and on 64 bit words.

The only bugs we’ve encountered were performance related, i.e. we’ve found inline assembly (PTX) code that performed shifts significantly faster than the code would generate for << and >> operators on unsigned long long words. But I am sure this was fixed in later version of the CUDA toolkit (we used version 5.0 back then)

Can you give an example of that? I last looked at this prior to the introduction of funnel shifts in hardware. As I recall, 64-bit shifts required eight or nine instructions for variable shift counts, and two or three instructions for compile-time constant shift counts. That is as efficient as they could be based on the requirements. In particular, shift counts saturate according to the PTX specification, they are not applied modulo 64 as in the x86 SHR, SHL instructions.

On architectures with funnel shifter, the emulations of 64-bit shifts should be more efficient.

Before I upgrade to CUDA 7.5 would someone with the new release version verify that this particular integer division bug does not occur for Maxwell (GTX 980 etc)?

Thanks

Looks OK on { sm_21, sm_30, sm_35, sm_50 } x { 32-bit, 64-bit }.