There is a bug in implementing extended-precision addition for compute 5.2 and below when compiling with the standard Visual Studio 2015/CUDA 8.0.26. Some numbers, when added, don’t carry correctly (or rather, appear to carry a -1 when a 0 should be carried, or carrying a 0 when a 1 should be carried). When compiling either in debug mode for 5.2 or for compute mode 6.0 or 6.1, the PTX works as intended.
I’m sure the compiler is breaking down the add.u64 into two 32-bit adds somewhere, but must be treating them differently than manually doing two 32-bit adds.
Please note that this forum is not designed as a bug reporting channel. CUDA bug reports should be filed using the form linked from the registered developer website (https://developer.nvidia.com/).
That said, I am not able to reproduce the issue at the moment. There could be two reasons for this: (1) I have an sm_50 device here, and compiled accordingly (2) I am using the latest shipping version of CUDA 8, while you seem to be using an earlier version.
C:\Users\Norbert\My Programs>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Mon_Jan__9_17:32:33_CST_2017
Cuda compilation tools, release 8.0, V8.0.60
C:\Users\Norbert\My Programs>nvcc -gencode=arch=compute_50,code=\"sm_50,compute_50\" -o add_bug.exe add_bug.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
add_bug.cu
support for Microsoft Visual Studio 2010 has been deprecated!
Creating library add_bug.lib and object add_bug.exp
C:\Users\Norbert\My Programs>add_bug
32-bit Addition: 0x5824f2440ed810fa
64-bit Addition: 0x5824f2440ed810fa
The issue seems reproduceable if I compile for an older architecture (I tried sm_30) and let the JIT compiler (driver 385.41) compile the PTX to my sm_50 GPU:
C:\Users\Norbert\My Programs>nvcc -gencode=arch=compute_30,code=\"sm_30,compute_30\" -o add_bug.exe add_bug.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
add_bug.cu
support for Microsoft Visual Studio 2010 has been deprecated!
Creating library add_bug.lib and object add_bug.exp
C:\Users\Norbert\My Programs>add_bug
32-bit Addition: 0x5824f2430ed810fa
64-bit Addition: 0x5824f2440ed810fa
For potential workarounds, I would suggest trying lowering the PTXAS optimization level (default: -O3), e.g. first try -Xptxas -O2,then -Xptxas -O1.
Ahh, my apologies, I’ll submit them there from now on :)
Did want it public though, as I couldn’t find anything online when I was chasing the bug and thought I was going crazy (so if anyone else is having the same issue, they’ll find this thread).
I just tried it on a 980 Ti compiling for sm_52 and it ran perfectly fine. However as you mentioned if I compile it for sm_30 on the 980 Ti the bug is still present.
So the problem may be with the PXTAS integrated into the driver, rather than with the PTXAS that is part of the offline compiler. These two versions should be similar, but they are rarely ever identical, due to different release schedules for driver packages (refreshed monthly) and CUDA package (refreshed twice per year of thereabouts).
Thanks for the insight and your help! I installed the latest version of CUDA 9 (release 9.0, V0.0.176), and the bug still exists. Everything works in debug mode on all platforms, but compiling on old platforms doesn’t work.
Also tried a 1080 and another 1080 Ti, all exhibited the same problems.
The bug also occurs with /Ox, /O2, /O1, and /Od. Adding the -G parameter for debug makes the problem go away.
/Ox, /O2, /O1, and /Od are switches for the host compiler MSVC and only affect host code. -G prepares device code for debugging by turning off all optimizations, making the code very slow.
Since current indications are that the issue is with JIT compilation, I would suggest building a fat binary that incorporates SASS (machine code) for all target architectures that need to be supported. In that scenario JIT compilation is never used.
The generated SASS code looks (to first order) roughly as expected, so only a detailed and time-consuming analysis is going to reveal why it does not always work. If I had to guess: incorrect reasoning about the carry bit when propagating constants.
If it helps, this issue was originally found when doing an extended-precision addition using non-constant values (it was operating on registers that had been set by other code doing other calculations).
Well, scrap my theory about constant propagation then. You would want to put a note in your bug report that the issue affects not just cases where the operands are immediate constants.
I was working with some extended-precision PTX code not too long ago but did not encounter any issues with carry propagation in the course of that work. PTXAS (an optimizing compiler) contains many architecture-specific code transformations, which likely explains that this issue only crops up in particular circumstances.
I am not able to avoid the bug with the latest sw, please help.
$nvcc -arch=sm_61 -Xptxas -O0 addc_bug.cu
works, but is horribly slow (or -G)
GPU GTX1080
Fedora 25 x86_64
latest CUDA
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
latest driver
Driver Version: 384.98
(384.81 included in CUDA - can be a problem ?)
Only NVIDIA can fix compiler bugs. File a bug report with them. The PTXAS in the driver only comes into play for JIT compiltion, and since GTX 1080 has compute capability 6.1 that shouldn’t happen if you compile with sm_61.
If you can’t even get it to work with -Xptxas -O1, try switching to the use of 64-bit operations as a workaround, as shown in the OP. Or try replacing your assembly code with C++ code, using the following set of macros (obviously this won’t be as fast as using addc and will inrease register pressure due to the use of temporary variables, but at least it should allow you to compile your code with full optimizations).
ADDC is broken across-the-board on CUDA 9, so you’ll have to use CUDA 8 and compile on a platform-specific basis for now.
I’ve submitted a bug report, but the issue has yet to be fixed.
On a side-note, I’ve since tested the same bug against CUDA9 with a V100 using compute capability 70 on an Ubuntu 16.04 system, and still experienced the same issue.
Filing a bug was the right thing to do, although I think it is likely we’ll have to wait until CUDA 9.5 (or whatever the next version is going to be called) for the fix to materialize.
This is actually kind of sad, because if memory serves, this functionality has been broken before. As I recall it broke one of the better-known prime-number search programs a few years back. One would think NVIDIA added appropriate regression test coverage after that.
OT:
Unfortunately only “-Xptxas -O0” (with fc25,gcc6.4.1 + cuda8 or cuda9) can avoid a bug in my other experiments :( possibly going deeper into cuda history is needed…
The CUDA 9.1 release notes mentions this bug. It sounds like the failure is still in 9.1, but can be worked around by using JIT. The release notes aren’t too clear though.
Thanks for great news, cuda 9.1 corrected ADDC bug for me. Unfortunately even using latest driver 387.34 and JIT did not help with my other experiments. Eagerly waiting for the R390 driver.