CUDA 8.0.26 Inline PTX "addc" Bug

macbymax · October 24, 2017, 1:56am

Platform:
Windows 7 64-Bit Edition
GPU: NVidia 1080 Ti
IDE: Microsoft Visual Studio 2015 Community
CUDA Version: 8.0.28
NVCC Version:

Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:03:03_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26

Debug Command Line for Compute 5.2:

c:\Users\user\documents\visual studio 2015\Projects\ADDC\ADDC>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env --cl-version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"  -G   --keep-dir x64\Debug -maxrregcount=0  --machine 64 --compile -cudart static  -g   -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o x64\Debug\kernel.cu.obj "c:\Users\user\documents\visual studio 2015\Projects\ADDC\ADDC\kernel.cu"

Release Command Line for Compute 5.2:

c:\Users\user\documents\visual studio 2015\Projects\ADDC\ADDC>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"     --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Zi  /MD " -o x64\Release\kernel.cu.obj "c:\Users\user\documents\visual studio 2015\Projects\ADDC\ADDC\kernel.cu"

Release Command Line for Compute 6.0:

c:\Users\user\documents\visual studio 2015\Projects\ADDC\ADDC>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_60,code=\"sm_60,compute_60\" --use-local-env --cl-version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"     --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Zi  /MD " -o x64\Release\kernel.cu.obj "c:\Users\user\documents\visual studio 2015\Projects\ADDC\ADDC\kernel.cu"

There is a bug in implementing extended-precision addition for compute 5.2 and below when compiling with the standard Visual Studio 2015/CUDA 8.0.26. Some numbers, when added, don’t carry correctly (or rather, appear to carry a -1 when a 0 should be carried, or carrying a 0 when a 1 should be carried). When compiling either in debug mode for 5.2 or for compute mode 6.0 or 6.1, the PTX works as intended.

uint64_t result32bitAdd = 0;
	uint64_t result64bitAdd = 0;

	asm(".reg .b32 r0;\n\t"
		".reg .b32 r1;\n\t"
		"add.cc.u32 r0, 0xc2775652, 0x3c60baa8;\n\t"
		"addc.cc.u32 r1, 0xa64ab78d, 0xb1da3ab6;\n\t"
		"mov.b64 %0, {r0, r1}\n\t;" 
		: "=l"(result32bitAdd));

	asm("add.u64 %0, 0xa64ab78dc2775652, 0xb1da3ab63c60baa8;\n\t"
		: "=l"(result64bitAdd));

	printf("32-bit Addition: %#016llx\n", result32bitAdd);
	printf("64-bit Addition: %#016llx\n", result64bitAdd);

The above code on Release sm52 outputs:

32-bit Addition: 0x5824f24<b>2</b>fed810fa
64-bit Addition: 0x5824f24<b>3</b>fed810fa

On Debug sm52 or Release sm60:

32-bit Addition: 0x5824f24<b>3</b>fed810fa
64-bit Addition: 0x5824f24<b>3</b>fed810fa

Note that c + 3 = f
If we change the 3 to a 4 to force a carry, we’re still off by 1:

uint64_t result32bitAdd = 0;
	uint64_t result64bitAdd = 0;

	asm(".reg .b32 r0;\n\t"
		".reg .b32 r1;\n\t"
		"add.cc.u32 r0, 0xc2775652, 0x4c60baa8;\n\t"
		"addc.cc.u32 r1, 0xa64ab78d, 0xb1da3ab6;\n\t"
		"mov.b64 %0, {r0, r1}\n\t;" 
		: "=l"(result32bitAdd));

	asm("add.u64 %0, 0xa64ab78dc2775652, 0xb1da3ab64c60baa8;\n\t"
		: "=l"(result64bitAdd));

	printf("32-bit Addition: %#016llx\n", result32bitAdd);
	printf("64-bit Addition: %#016llx\n", result64bitAdd);

Release mode sm52 Output:

32-bit Addition: 0x5824f24<b>3</b>0ed810fa
64-bit Addition: 0x5824f24<b>4</b>0ed810fa

On Debug sm52 or Release sm60:

32-bit Addition: 0x5824f24<b>4</b>0ed810fa
64-bit Addition: 0x5824f24<b>4</b>0ed810fa

I’m sure the compiler is breaking down the add.u64 into two 32-bit adds somewhere, but must be treating them differently than manually doing two 32-bit adds.

Here’s a Visual Studio project that can be used to reproduce the problem:
http://www.filedropper.com/addc

Or here’s the code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cstdint>
#include <stdio.h>

cudaError_t addcBug();

__global__ void addcBugKernel()
{
	uint64_t result32bitAdd = 0;
	uint64_t result64bitAdd = 0;

	asm(".reg .b32 r0;\n\t"
		".reg .b32 r1;\n\t"
		"add.cc.u32 r0, 0xc2775652, 0x4c60baa8;\n\t"
		"addc.cc.u32 r1, 0xa64ab78d, 0xb1da3ab6;\n\t"
		"mov.b64 %0, {r0, r1}\n\t;" 
		: "=l"(result32bitAdd));

	asm("add.u64 %0, 0xa64ab78dc2775652, 0xb1da3ab64c60baa8;\n\t"
		: "=l"(result64bitAdd));

	printf("32-bit Addition: %#016llx\n", result32bitAdd);
	printf("64-bit Addition: %#016llx\n", result64bitAdd);
}

int main()
{
    cudaError_t cudaStatus = addcBug();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

	getchar();

    return 0;
}

cudaError_t addcBug()
{
   cudaError_t cudaStatus;
    cudaStatus = cudaSetDevice(0);

	addcBugKernel <<<1, 1>>>();

    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }
Error:
    
    return cudaStatus;
}

njuffa · October 24, 2017, 2:12am

Please note that this forum is not designed as a bug reporting channel. CUDA bug reports should be filed using the form linked from the registered developer website (https://developer.nvidia.com/).

That said, I am not able to reproduce the issue at the moment. There could be two reasons for this: (1) I have an sm_50 device here, and compiled accordingly (2) I am using the latest shipping version of CUDA 8, while you seem to be using an earlier version.

C:\Users\Norbert\My Programs>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Mon_Jan__9_17:32:33_CST_2017
Cuda compilation tools, release 8.0, V8.0.60

C:\Users\Norbert\My Programs>nvcc -gencode=arch=compute_50,code=\"sm_50,compute_50\" -o add_bug.exe add_bug.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
add_bug.cu
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library add_bug.lib and object add_bug.exp

C:\Users\Norbert\My Programs>add_bug
32-bit Addition: 0x5824f2440ed810fa
64-bit Addition: 0x5824f2440ed810fa

njuffa · October 24, 2017, 2:16am

The issue seems reproduceable if I compile for an older architecture (I tried sm_30) and let the JIT compiler (driver 385.41) compile the PTX to my sm_50 GPU:

C:\Users\Norbert\My Programs>nvcc -gencode=arch=compute_30,code=\"sm_30,compute_30\" -o add_bug.exe add_bug.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
add_bug.cu
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library add_bug.lib and object add_bug.exp

C:\Users\Norbert\My Programs>add_bug
32-bit Addition: 0x5824f2430ed810fa
64-bit Addition: 0x5824f2440ed810fa

For potential workarounds, I would suggest trying lowering the PTXAS optimization level (default: -O3), e.g. first try -Xptxas -O2,then -Xptxas -O1.

macbymax · October 24, 2017, 2:21am

Please note that this forum is not designed as a bug reporting channel. CUDA bug reports should be filed using the form linked from the registered developer website.

That said, I am not able to reproduce the issue at the moment. There could be two reasons for this: (1) I have an sm_50 device here, and compiled accordingly (2) I am using the latest shipping version of CUDA 8, while you seem to be using an earlier version.
C:\Users\Norbert\My Programs>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Mon_Jan__9_17:32:33_CST_2017
Cuda compilation tools, release 8.0, V8.0.60

C:\Users\Norbert\My Programs>nvcc -gencode=arch=compute_50,code=\"sm_50,compute_50\" -o add_bug.exe add_bug.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
add_bug.cu
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library add_bug.lib and object add_bug.exp

C:\Users\Norbert\My Programs>add_bug
32-bit Addition: 0x5824f2440ed810fa
64-bit Addition: 0x5824f2440ed810fa

Ahh, my apologies, I’ll submit them there from now on :)
Did want it public though, as I couldn’t find anything online when I was chasing the bug and thought I was going crazy (so if anyone else is having the same issue, they’ll find this thread).

The issue seems reproduceable if I compile for an older architecture (I tried sm_30) and let the JIT compiler compile the PTX to my sm_50 GPU:

C:\Users\Norbert\My Programs>nvcc -gencode=arch=compute_30,code=\"sm_30,compute_30\" -o add_bug.exe add_bug.cu
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
add_bug.cu
support for Microsoft Visual Studio 2010 has been deprecated!
Creating library add_bug.lib and object add_bug.exp

C:\Users\Norbert\My Programs>add_bug
32-bit Addition: 0x5824f2430ed810fa
64-bit Addition: 0x5824f2440ed810fa

I just tried it on a 980 Ti compiling for sm_52 and it ran perfectly fine. However as you mentioned if I compile it for sm_30 on the 980 Ti the bug is still present.

njuffa · October 24, 2017, 2:25am

So the problem may be with the PXTAS integrated into the driver, rather than with the PTXAS that is part of the offline compiler. These two versions should be similar, but they are rarely ever identical, due to different release schedules for driver packages (refreshed monthly) and CUDA package (refreshed twice per year of thereabouts).

macbymax · October 24, 2017, 2:45am

Thanks for the insight and your help! I installed the latest version of CUDA 9 (release 9.0, V0.0.176), and the bug still exists. Everything works in debug mode on all platforms, but compiling on old platforms doesn’t work.

Also tried a 1080 and another 1080 Ti, all exhibited the same problems.

The bug also occurs with /Ox, /O2, /O1, and /Od. Adding the -G parameter for debug makes the problem go away.

njuffa · October 24, 2017, 2:52am

/Ox, /O2, /O1, and /Od are switches for the host compiler MSVC and only affect host code. -G prepares device code for debugging by turning off all optimizations, making the code very slow.

Since current indications are that the issue is with JIT compilation, I would suggest building a fat binary that incorporates SASS (machine code) for all target architectures that need to be supported. In that scenario JIT compilation is never used.

macbymax · October 24, 2017, 2:59am

Ahh good point. For the record, -Xptxas -O2 and -Xptxas -O1 don’t solve the issue either.

In case it helps anyone, here is the actual assembly for the 1080 Ti using sm_52 in Debug:

/* 0x00643c03fde01fef */
        /*0008*/                   MOV R1, c[0x0][0x20];                 /* 0x4c98078000870001 */
        /*0010*/                   IADD32I R1, R1, -0x8;                 /* 0x1c0fffffff870101 */
        /*0018*/                   S2R R0, SR_LMEMHIOFF;                 /* 0xf0c8000003770000 */
                                                                         /* 0x007fbc03fde01fef */
        /*0028*/                   ISETP.GE.U32.AND P0, PT, R1, R0, PT;  /* 0x5b6c038000070107 */
        /*0030*/               @P0 BRA 0x40;                             /* 0xe24000000080000f */
        /*0038*/                   BPT.TRAP 0x1;                         /* 0xe3a00000001000c0 */
                                                                         /* 0x007fbc0321e01fef */
        /*0048*/                   IADD R0, R1, RZ;                      /* 0x5c1000000ff70100 */
        /*0050*/                   I2I.U32.U32 R0, R0;                   /* 0x5ce0000000070a00 */
        /*0058*/                   MOV R2, R0;                           /* 0x5c98078000070002 */
                                                                         /* 0x007fbc03fde01fef */
        /*0068*/                   MOV R3, RZ;                           /* 0x5c9807800ff70003 */
        /*0070*/                   MOV R2, R2;                           /* 0x5c98078000270002 */
        /*0078*/                   MOV R3, R3;                           /* 0x5c98078000370003 */
                                                                         /* 0x007fbc03fde01fef */
        /*0088*/                   MOV R4, R2;                           /* 0x5c98078000270004 */
        /*0090*/                   MOV R3, R3;                           /* 0x5c98078000370003 */
        /*0098*/                   MOV R0, c[0x0][0x4];                  /* 0x4c98078000170000 */
                                                                         /* 0x007fbc03fde01fef */
        /*00a8*/                   MOV R2, RZ;                           /* 0x5c9807800ff70002 */
        /*00b0*/                   LOP.OR R0, R4, R0;                    /* 0x5c47020000070400 */
        /*00b8*/                   LOP.OR R2, R3, R2;                    /* 0x5c47020000270302 */
                                                                         /* 0x007fbc03fde01fef */
        /*00c8*/                   MOV R4, RZ;                           /* 0x5c9807800ff70004 */
        /*00d0*/                   MOV R5, RZ;                           /* 0x5c9807800ff70005 */
        /*00d8*/                   MOV R3, R2;                           /* 0x5c98078000270003 */
                                                                         /* 0x007fbc03fde01fef */
        /*00e8*/                   MOV R2, R0;                           /* 0x5c98078000070002 */
        /*00f0*/                   MOV R2, R2;                           /* 0x5c98078000270002 */
        /*00f8*/                   MOV R3, R3;                           /* 0x5c98078000370003 */
                                                                         /* 0x007fbc03fde01fef */
        /*0108*/                   MOV R16, R2;                          /* 0x5c98078000270010 */
        /*0110*/                   MOV R2, R3;                           /* 0x5c98078000370002 */
        /*0118*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
                                                                         /* 0x007fbc03fde01fef */
        /*0128*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
        /*0130*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
        /*0138*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
                                                                         /* 0x007fbc03fde01fef */
        /*0148*/                   MOV R6, R4;                           /* 0x5c98078000470006 */
        /*0150*/                   MOV R7, R5;                           /* 0x5c98078000570007 */
        /*0158*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
                                                                         /* 0x007fbc03fde01fef */
        /*0168*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
        /*0170*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
        /*0178*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
                                                                         /* 0x007fbc03fde01fef */
        /*0188*/                   MOV32I R0, 0xffffffff;                /* 0x010ffffffff7f000 */
        /*0190*/                   IADD32I R0, R0, -0x1;                 /* 0x1c0ffffffff70000 */
        /*0198*/                   MOV32I R3, 0xffffffff;                /* 0x010ffffffff7f003 */
                                                                         /* 0x007fbc03fde01fef */
        /*01a8*/                   IADD32I RZ.CC, R3, -0x1;              /* 0x1c1ffffffff703ff */
        /*01b0*/                   MOV32I R3, 0xffffffff;                /* 0x010ffffffff7f003 */
        /*01b8*/                   IADD32I.X R3, R3, -0x1;               /* 0x1c2ffffffff70303 */
                                                                         /* 0x007fbc03fde01fef */
        /*01c8*/                   MOV R8, R0;                           /* 0x5c98078000070008 */
        /*01d0*/                   MOV R9, R3;                           /* 0x5c98078000370009 */
        /*01d8*/                   MOV R22, R8;                          /* 0x5c98078000870016 */
                                                                         /* 0x007fbc03fde01fef */
        /*01e8*/                   MOV R23, R9;                          /* 0x5c98078000970017 */
        /*01f0*/                   MOV R0, R6;                           /* 0x5c98078000670000 */
        /*01f8*/                   MOV R3, R7;                           /* 0x5c98078000770003 */
                                                                         /* 0x007fbc03fde01fef */
        /*0208*/                   MOV32I R0, 0xffffffff;                /* 0x010ffffffff7f000 */
        /*0210*/                   IADD32I R0.CC, R0, -0x1;              /* 0x1c1ffffffff70000 */
        /*0218*/                   MOV32I R3, 0xffffffff;                /* 0x010ffffffff7f003 */
                                                                         /* 0x007fbc03fde01fef */
        /*0228*/                   IADD32I.X R3, R3, -0x1;               /* 0x1c2ffffffff70303 */
        /*0230*/                   MOV R6, R0;                           /* 0x5c98078000070006 */
        /*0238*/                   MOV R7, R3;                           /* 0x5c98078000370007 */
                                                                         /* 0x007fbc03fde01fef */
        /*0248*/                   MOV R20, R6;                          /* 0x5c98078000670014 */
        /*0250*/                   MOV R21, R7;                          /* 0x5c98078000770015 */
        /*0258*/                   MOV R0, R4;                           /* 0x5c98078000470000 */
                                                                         /* 0x007fbc03fde01fef */
        /*0268*/                   MOV R3, R5;                           /* 0x5c98078000570003 */
        /*0270*/                   IADD R0.CC, R16, RZ;                  /* 0x5c1080000ff71000 */
        /*0278*/                   IADD.X R3, R2, RZ;                    /* 0x5c1008000ff70203 */
                                                                         /* 0x007fbc03fde01fef */
        /*0288*/                   MOV R4, R0;                           /* 0x5c98078000070004 */
        /*0290*/                   MOV R5, R3;                           /* 0x5c98078000370005 */
        /*0298*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
                                                                         /* 0x007fbc03fde01fef */
        /*02a8*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
        /*02b0*/                   MOV R3, R4;                           /* 0x5c98078000470003 */
        /*02b8*/                   MOV R4, R5;                           /* 0x5c98078000570004 */
                                                                         /* 0x007fbc03fde01fef */
        /*02c8*/                   LEA R0.CC, R3, RZ;                    /* 0x5bd780000ff70300 */
        /*02d0*/                   LEA.HI.X P0, R3, R3, RZ, R4;          /* 0x5bd802400ff70303 */
        /*02d8*/                   MOV R4, R0;                           /* 0x5c98078000070004 */
                                                                         /* 0x007fbc033de01fef */
        /*02e8*/                   MOV R5, R3;                           /* 0x5c98078000370005 */
        /*02f0*/                   ST.E.64 [R4], R22, P0;                /* 0xa0b0000000070416 */
        /*02f8*/                   MOV32I R0, 0x0;                       /* 0x010000000007f000 */
                                                                         /* 0x007fbc03fde01fef */
        /*0308*/                   MOV32I R3, 0x0;                       /* 0x010000000007f003 */
        /*0310*/                   MOV R4, R0;                           /* 0x5c98078000070004 */
        /*0318*/                   MOV R5, R3;                           /* 0x5c98078000370005 */
                                                                         /* 0x007fbc03fde01fef */
        /*0328*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
        /*0330*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
        /*0338*/                   MOV R4, R4;                           /* 0x5c98078000470004 */
                                                                         /* 0x007fbc03fde01fef */
        /*0348*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
        /*0350*/                   MOV R6, R4;                           /* 0x5c98078000470006 */
        /*0358*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
                                                                         /* 0x007fbc03fde01fef */
        /*0368*/                   IADD R0.CC, R16, RZ;                  /* 0x5c1080000ff71000 */
        /*0370*/                   IADD.X R3, R2, RZ;                    /* 0x5c1008000ff70203 */
        /*0378*/                   MOV R4, R6;                           /* 0x5c98078000670004 */
                                                                         /* 0x007fbc03fde01fef */
        /*0388*/                   MOV R5, R5;                           /* 0x5c98078000570005 */
        /*0390*/                   MOV R6, R0;                           /* 0x5c98078000070006 */
        /*0398*/                   MOV R7, R3;                           /* 0x5c98078000370007 */
                                                                         /* 0x007fbc03fde01fef */
        /*03a8*/                   MOV R8, R0;                           /* 0x5c98078000070008 */
        /*03b0*/                   MOV R9, R3;                           /* 0x5c98078000370009 */
        /*03b8*/                   MOV R8, R8;                           /* 0x5c98078000870008 */
                                                                         /* 0x007fbc03fde01fef */
        /*03c8*/                   MOV R9, R9;                           /* 0x5c98078000970009 */
        /*03d0*/                   MOV R18, R8;                          /* 0x5c98078000870012 */
        /*03d8*/                   MOV R17, R9;                          /* 0x5c98078000970011 */
                                                                         /* 0x007fbc03fde01fef */
        /*03e8*/                   JCAL 0x0;                             /* 0xe220000000000040 */
        /*03f0*/                   IADD R0.CC, R16, RZ;                  /* 0x5c1080000ff71000 */
        /*03f8*/                   IADD.X R2, R2, RZ;                    /* 0x5c1008000ff70202 */
                                                                         /* 0x007fbc03fde01fef */
        /*0408*/                   MOV R3, R2;                           /* 0x5c98078000270003 */
        /*0410*/                   MOV R2, R0;                           /* 0x5c98078000070002 */
        /*0418*/                   MOV R2, R2;                           /* 0x5c98078000270002 */
                                                                         /* 0x007fbc03fde01fef */
        /*0428*/                   MOV R3, R3;                           /* 0x5c98078000370003 */
        /*0430*/                   MOV R4, R2;                           /* 0x5c98078000270004 */
        /*0438*/                   MOV R2, R3;                           /* 0x5c98078000370002 */
                                                                         /* 0x007fbc03fde01fef */
        /*0448*/                   LEA R0.CC, R4, RZ;                    /* 0x5bd780000ff70400 */
        /*0450*/                   LEA.HI.X P0, R2, R4, RZ, R2;          /* 0x5bd801400ff70402 */
        /*0458*/                   MOV R3, R2;                           /* 0x5c98078000270003 */
                                                                         /* 0x007fbc033de01fef */
        /*0468*/                   MOV R2, R0;                           /* 0x5c98078000070002 */
        /*0470*/                   ST.E.64 [R2], R20, P0;                /* 0xa0b0000000070214 */
        /*0478*/                   MOV32I R0, 0x0;                       /* 0x010000000007f000 */
                                                                         /* 0x007fbc03fde01fef */
        /*0488*/                   MOV32I R2, 0x0;                       /* 0x010000000007f002 */
        /*0490*/                   MOV R3, R2;                           /* 0x5c98078000270003 */
        /*0498*/                   MOV R2, R0;                           /* 0x5c98078000070002 */
                                                                         /* 0x007fbc03fde01fef */
        /*04a8*/                   MOV R2, R2;                           /* 0x5c98078000270002 */
        /*04b0*/                   MOV R3, R3;                           /* 0x5c98078000370003 */
        /*04b8*/                   MOV R2, R2;                           /* 0x5c98078000270002 */
                                                                         /* 0x007fbc03fde01fef */
        /*04c8*/                   MOV R3, R3;                           /* 0x5c98078000370003 */
        /*04d0*/                   MOV R0, R2;                           /* 0x5c98078000270000 */
        /*04d8*/                   MOV R2, R3;                           /* 0x5c98078000370002 */
                                                                         /* 0x007fbc03fde01fef */
        /*04e8*/                   MOV R4, R0;                           /* 0x5c98078000070004 */
        /*04f0*/                   MOV R5, R2;                           /* 0x5c98078000270005 */
        /*04f8*/                   MOV R6, R18;                          /* 0x5c98078001270006 */
                                                                         /* 0x007fbc03fde01fef */
        /*0508*/                   MOV R7, R17;                          /* 0x5c98078001170007 */
        /*0510*/                   JCAL 0x0;                             /* 0xe220000000000040 */
        /*0518*/                   MOV R0, R20;                          /* 0x5c98078001470000 */
                                                                         /* 0x007fbc03fde01fef */
        /*0528*/                   MOV R2, R21;                          /* 0x5c98078001570002 */
        /*0530*/                   MOV R0, R22;                          /* 0x5c98078001670000 */
        /*0538*/                   MOV R2, R23;                          /* 0x5c98078001770002 */
                                                                         /* 0x001ffc03fde01fef */
        /*0548*/                   EXIT;                                 /* 0xe30000000007000f */
        /*0550*/                   EXIT;                                 /* 0xe30000000007000f */
        /*0558*/                   BRA 0x558;                            /* 0xe2400fffff87000f */
                                                                         /* 0x001f8000fc0007e0 */
        /*0568*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*0570*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*0578*/                   NOP;                                  /* 0x50b0000000070f00 */

In sm_52 in Release:

/*0008*/                   MOV R1, c[0x0][0x20];                   /* 0x4c98078000870001 */
        /*0010*/                   MOV32I R0, 0x1;                         /* 0x010000000017f000 */
        /*0018*/                   MOV32I R2, 0xffffffff;                  /* 0x010ffffffff7f002 */
                                                                           /* 0x001fc400fec007f4 */
        /*0028*/                   IADD32I R1, R1, -0x8;                   /* 0x1c0fffffff870101 */
        /*0030*/                   IADD32I RZ.CC, -R0, -0x1;               /* 0x1d1ffffffff700ff */
        /*0038*/                   IADD32I.X R9, R2, -0x1;                 /* 0x1c2ffffffff70209 */
                                                                           /* 0x001fc400fea007e1 */
        /*0048*/                   IADD R16.CC, R1, c[0x0][0x4];           /* 0x4c10800000170110 */
        /*0050*/                   MOV32I R8, 0xfffffffe;                  /* 0x010fffffffe7f008 */
        /*0058*/                   IADD.X R2, RZ, c[0x0][0x104];           /* 0x4c1008000417ff02 */
                                                                           /* 0x0003c400fe0007f2 */
        /*0068*/                   IADD R17, R16, -c[0x0][0x4];            /* 0x4c11000000171011 */
        /*0070*/         {         MOV32I R4, 0x0;                         /* 0x010000000007f004 */
        /*0078*/                   STL.64 [R17], R8;        }              /* 0xef55000000071108 */
                                                                           /* 0x001fd800fc2007f1 */
        /*0088*/                   MOV32I R5, 0x0;                         /* 0x010000000007f005 */
        /*0090*/                   MOV R6, R16;                            /* 0x5c98078001070006 */
        /*0098*/                   MOV R7, R2;                             /* 0x5c98078000270007 */
                                                                           /* 0x001fc400fe200ffd */
        /*00a8*/                   JCAL 0x0;                               /* 0xe220000000000040 */
        /*00b0*/                   MOV R7, R2;                             /* 0x5c98078000270007 */
        /*00b8*/                   MOV32I R2, 0xfffffffe;                  /* 0x010fffffffe7f002 */
                                                                           /* 0x0003c400fe0007f2 */
        /*00c8*/                   MOV32I R3, 0xffffffff;                  /* 0x010ffffffff7f003 */
        /*00d0*/         {         MOV R6, R16;                            /* 0x5c98078001070006 */
        /*00d8*/                   STL.64 [R17], R2;        }              /* 0xef55000000071102 */
                                                                           /* 0x003ff400fcc007f1 */
        /*00e8*/                   MOV32I R4, 0x0;                         /* 0x010000000007f004 */
        /*00f0*/                   MOV32I R5, 0x0;                         /* 0x010000000007f005 */
        /*00f8*/                   JCAL 0x0;                               /* 0xe220000000000040 */
                                                                           /* 0x001f8000ffe007ff */
        /*0108*/                   EXIT;                                   /* 0xe30000000007000f */
        /*0110*/                   BRA 0x110;                              /* 0xe2400fffff87000f */
        /*0118*/                   NOP;                                    /* 0x50b0000000070f00 */
                                                                           /* 0x001f8000fc0007e0 */
        /*0128*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0130*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0138*/                   NOP;                                    /* 0x50b0000000070f00 */

For sm_60:

/*0008*/                   MOV R1, c[0x0][0x20];                   /* 0x4c98078000870001 */
        /*0010*/                   MOV32I R0, 0x1;                         /* 0x010000000017f000 */
        /*0018*/                   MOV32I R2, 0xffffffff;                  /* 0x010ffffffff7f002 */
                                                                           /* 0x001fc400fec007f4 */
        /*0028*/                   IADD32I R1, R1, -0x8;                   /* 0x1c0fffffff870101 */
        /*0030*/                   IADD32I RZ.CC, -R0, -0x1;               /* 0x1d1ffffffff700ff */
        /*0038*/                   IADD32I.X R9, R2, -0x1;                 /* 0x1c2ffffffff70209 */
                                                                           /* 0x001fc400fea007e1 */
        /*0048*/                   IADD R16.CC, R1, c[0x0][0x4];           /* 0x4c10800000170110 */
        /*0050*/                   MOV32I R8, 0xfffffffe;                  /* 0x010fffffffe7f008 */
        /*0058*/                   IADD.X R2, RZ, c[0x0][0x104];           /* 0x4c1008000417ff02 */
                                                                           /* 0x0003c400fe0007f2 */
        /*0068*/                   IADD R17, R16, -c[0x0][0x4];            /* 0x4c11000000171011 */
        /*0070*/         {         MOV32I R4, 0x0;                         /* 0x010000000007f004 */
        /*0078*/                   STL.64 [R17], R8;        }              /* 0xef55000000071108 */
                                                                           /* 0x001fd800fc2007f1 */
        /*0088*/                   MOV32I R5, 0x0;                         /* 0x010000000007f005 */
        /*0090*/                   MOV R6, R16;                            /* 0x5c98078001070006 */
        /*0098*/                   MOV R7, R2;                             /* 0x5c98078000270007 */
                                                                           /* 0x001fc400fe200ffd */
        /*00a8*/                   JCAL 0x0;                               /* 0xe220000000000040 */
        /*00b0*/                   MOV R7, R2;                             /* 0x5c98078000270007 */
        /*00b8*/                   MOV32I R2, 0xfffffffe;                  /* 0x010fffffffe7f002 */
                                                                           /* 0x0003c400fe0007f2 */
        /*00c8*/                   MOV32I R3, 0xffffffff;                  /* 0x010ffffffff7f003 */
        /*00d0*/         {         MOV R6, R16;                            /* 0x5c98078001070006 */
        /*00d8*/                   STL.64 [R17], R2;        }              /* 0xef55000000071102 */
                                                                           /* 0x003ff400fcc007f1 */
        /*00e8*/                   MOV32I R4, 0x0;                         /* 0x010000000007f004 */
        /*00f0*/                   MOV32I R5, 0x0;                         /* 0x010000000007f005 */
        /*00f8*/                   JCAL 0x0;                               /* 0xe220000000000040 */
                                                                           /* 0x001f8000ffe007ff */
        /*0108*/                   EXIT;                                   /* 0xe30000000007000f */
        /*0110*/                   BRA 0x110;                              /* 0xe2400fffff87000f */
        /*0118*/                   NOP;                                    /* 0x50b0000000070f00 */
                                                                           /* 0x001f8000fc0007e0 */
        /*0128*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0130*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0138*/                   NOP;                                    /* 0x50b0000000070f00 */

And finally for the (working) 980 Ti with sm_52:

/*0008*/                   MOV R1, c[0x0][0x20];                   /* 0x4c98078000870001 */
        /*0010*/                   IADD32I R1, R1, -0x8;                   /* 0x1c0fffffff870101 */
        /*0018*/                   MOV32I R0, 0x1;                         /* 0x010000000017f000 */
                                                                           /* 0x001fc400fe2007f4 */
        /*0028*/                   MOV32I R3, 0xffffffff;                  /* 0x010ffffffff7f003 */
        /*0030*/                   LOP.OR R2, R1, c[0x0][0x4];             /* 0x4c47020000170102 */
        /*0038*/                   IADD32I RZ.CC, -R0, -0x1;               /* 0x1d1ffffffff700ff */
                                                                           /* 0x001fc800fe2007e1 */
        /*0048*/                   MOV32I R10, 0xfffffffe;                 /* 0x010fffffffe7f00a */
        /*0050*/                   MOV32I R4, 0x0;                         /* 0x010000000007f004 */
        /*0058*/                   MOV32I R5, 0x0;                         /* 0x010000000007f005 */
                                                                           /* 0x001fc000fe4007f1 */
        /*0068*/                   LOP32I.AND R16, R2, 0xffffff;           /* 0x04000ffffff70210 */
        /*0070*/                   IADD32I.X R11, R3, -0x1;                /* 0x1c2ffffffff7030b */
        /*0078*/         {         MOV R7, RZ;                             /* 0x5c9807800ff70007 */
        /*0088*/                   STL.64 [R16], R10;        }             /* 0x003ff400fcc000f1 */
                                                                           /* 0xef5500000007100a */
        /*0090*/                   MOV R6, R2;                             /* 0x5c98078000270006 */
        /*0098*/                   JCAL 0x0;                               /* 0xe220000000000040 */
                                                                           /* 0x001fc800fe2007f1 */
        /*00a8*/                   MOV R6, R2;                             /* 0x5c98078000270006 */
        /*00b0*/                   MOV32I R2, 0xfffffffe;                  /* 0x010fffffffe7f002 */
        /*00b8*/                   MOV32I R3, 0xffffffff;                  /* 0x010ffffffff7f003 */
                                                                           /* 0x001fc4001e2007f0 */
        /*00c8*/         {         MOV32I R4, 0x0;                         /* 0x010000000007f004 */
        /*00d0*/                   STL.64 [R16], R2;        }              /* 0xef55000000071002 */
        /*00d8*/                   MOV32I R5, 0x0;                         /* 0x010000000007f005 */
                                                                           /* 0x001ffc01ffa007e6 */
        /*00e8*/                   MOV R7, RZ;                             /* 0x5c9807800ff70007 */
        /*00f0*/                   JCAL 0x0;                               /* 0xe220000000000040 */
        /*00f8*/                   EXIT;                                   /* 0xe30000000007000f */
                                                                           /* 0x001f8000fc0007ff */
        /*0108*/                   BRA 0x100;                              /* 0xe2400fffff07000f */
        /*0110*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0118*/                   NOP;                                    /* 0x50b0000000070f00 */
                                                                           /* 0x001f8000fc0007e0 */
        /*0128*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0130*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0138*/                   NOP;                                    /* 0x50b0000000070f00 */

I’ll just make a fat binary with SASS like you suggested. Thanks for the help!

njuffa · October 24, 2017, 3:07am

The generated SASS code looks (to first order) roughly as expected, so only a detailed and time-consuming analysis is going to reveal why it does not always work. If I had to guess: incorrect reasoning about the carry bit when propagating constants.

macbymax · October 24, 2017, 3:15am

If it helps, this issue was originally found when doing an extended-precision addition using non-constant values (it was operating on registers that had been set by other code doing other calculations).

Thanks!

njuffa · October 24, 2017, 3:55am

Well, scrap my theory about constant propagation then. You would want to put a note in your bug report that the issue affects not just cases where the operands are immediate constants.

I was working with some extended-precision PTX code not too long ago but did not encounter any issues with carry propagation in the course of that work. PTXAS (an optimizing compiler) contains many architecture-specific code transformations, which likely explains that this issue only crops up in particular circumstances.

macbymax · October 24, 2017, 6:41pm

Will do, thanks!

m11 · November 9, 2017, 4:13pm

I am not able to avoid the bug with the latest sw, please help.

$nvcc -arch=sm_61 -Xptxas -O0 addc_bug.cu
works, but is horribly slow (or -G)

GPU GTX1080

Fedora 25 x86_64

latest CUDA
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

latest driver
Driver Version: 384.98
(384.81 included in CUDA - can be a problem ?)

njuffa · November 9, 2017, 4:21pm

Only NVIDIA can fix compiler bugs. File a bug report with them. The PTXAS in the driver only comes into play for JIT compiltion, and since GTX 1080 has compute capability 6.1 that shouldn’t happen if you compile with sm_61.

If you can’t even get it to work with -Xptxas -O1, try switching to the use of 64-bit operations as a workaround, as shown in the OP. Or try replacing your assembly code with C++ code, using the following set of macros (obviously this won’t be as fast as using addc and will inrease register pressure due to the use of temporary variables, but at least it should allow you to compile your code with full optimizations).

#define SUBCcc(a,b,cy,t0,t1,t2) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t2=t1<t0, cy=cy+t2, t1-t0)
#define SUBcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), cy=t1<t0, t1-t0)
#define SUBC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t1-t0)
#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

macbymax · November 19, 2017, 5:35am

ADDC is broken across-the-board on CUDA 9, so you’ll have to use CUDA 8 and compile on a platform-specific basis for now.

I’ve submitted a bug report, but the issue has yet to be fixed.

On a side-note, I’ve since tested the same bug against CUDA9 with a V100 using compute capability 70 on an Ubuntu 16.04 system, and still experienced the same issue.

–Vorksholk

njuffa · November 19, 2017, 8:26am

Filing a bug was the right thing to do, although I think it is likely we’ll have to wait until CUDA 9.5 (or whatever the next version is going to be called) for the fix to materialize.

This is actually kind of sad, because if memory serves, this functionality has been broken before. As I recall it broke one of the better-known prime-number search programs a few years back. One would think NVIDIA added appropriate regression test coverage after that.

m11 · December 10, 2017, 9:29pm

I can confirm cuda8 can avoid the addc bug.

OT:
Unfortunately only “-Xptxas -O0” (with fc25,gcc6.4.1 + cuda8 or cuda9) can avoid a bug in my other experiments :( possibly going deeper into cuda history is needed…

SPWorley · December 16, 2017, 6:47am

The CUDA 9.1 release notes mentions this bug. It sounds like the failure is still in 9.1, but can be worked around by using JIT. The release notes aren’t too clear though.

m11 · December 26, 2017, 9:37am

Thanks for great news, cuda 9.1 corrected ADDC bug for me. Unfortunately even using latest driver 387.34 and JIT did not help with my other experiments. Eagerly waiting for the R390 driver.

Happy new year !

Topic		Replies	Views
nvcc compiler bug CUDA Programming and Performance	5	1209	July 15, 2010
throughput of integer add CUDA Programming and Performance	17	3122	August 15, 2011
PTX miscompiled to SASS in a specific case (shared memory buffer index)? CUDA Programming and Performance	16	1599	December 27, 2023
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3633	April 2, 2015
Addressing bug on sm_2x CUDA Programming and Performance	4	925	January 12, 2013
atomicAdd() ? GT240 and GTX570 CUDA Programming and Performance	7	5116	July 21, 2011
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7866	January 2, 2018
Some issues on matrix calculation CUDA Programming and Performance	6	6543	March 26, 2010
Bug in nvcc, incorrect sign-extend for textures Incorrect sign-extend when the results of one textur CUDA Programming and Performance	3	2844	May 5, 2009
Signal: caught in Code_Expansion phase. CUDA Programming and Performance	1	1263	August 5, 2011

CUDA 8.0.26 Inline PTX "addc" Bug

Related topics