Is this a bug of NVCC 5.5 on code generation/optimization?

OQ1 · April 24, 2014, 9:51am

It seems that I found a bug of nvcc 5.5 in code generation.
I tested nvcc 5.5 on a x64 openSUSE 13.1 with the following (very simple) code. I installed CUDA toolkit from nVidia’s CUDA repository for openSUSE.

//this is test.cu
__device__ int test_device() {
    __shared__ int z[1024];
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    return z[i];
}

__global__ void test(int output[]) {
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    output[i] = test_device();
}

int main(int argc, char *argv[])
{
    int *d_output;
    cudaMalloc(&d_output, sizeof(int) * 1024);

    int threadsPerBlock = 256;
    int blocksPerGrid = 1024 / threadsPerBlock;
    test<<<blocksPerGrid, threadsPerBlock>>>(d_output);
        
    cudaFree(d_output);
}

I generated its PTX code. The compiling command is:

nvcc test.cu -ptx -o test.ptx

and I got a PTX file like this:

mov.u16 	%rh1, %ctaid.x;
mov.u16 	%rh2, %ntid.x;
mul.wide.u16 	%r1, %rh1, %rh2;
cvt.u32.u16 	%r2, %tid.x;
add.u32 	%r3, %r2, %r1;
cvt.u64.u32 	%rd1, %r3;
mul.wide.u32 	%rd2, %r3, 4;
mov.u64 	%rd3, __cuda_local_var_31148_33_non_const_z__0;
add.u64 	%rd4, %rd2, %rd3;
ld.shared.s32 	%r4, [%rd4+0];
ld.param.u64 	%rd5, [__cudaparm__Z4testPi_output];
add.u64 	%rd6, %rd5, %rd2;
st.global.s32 	[%rd6+0], %r4;

Note that in line 6, integer in r3 was converted to 64-bit and stored in rd1. But after that rd1 was never used!
And there’s also problem in line 7, which multiplied r3 by 4.
But for strength reduction:

shl.b64		%rd2, %rd1, 2

is a better solution.
Do you get the same result with your nvcc compiler? Is this a bug of NVCC?

cbuchner1 · April 24, 2014, 10:10am

do you get the same code when passing optimization flags such as -O2 or -O3 ?

Robert_Crovella · April 24, 2014, 3:11pm

When I compile your code, I do get the same ptx. However the compiler spits out a warning:

variable “z” is used before its value is set

If I modify your device function to initialize z, I don’t see this observation.

OQ1 · April 25, 2014, 3:07am

It’s not the problem of optimization flag. Device code is optimized at -O3 level by default.

Yes, my code triggered compilation warning. But it doesn’t matter.
Even if I insert

z[i]=i;

between line 4 and 5 to initialize the shared memory, this problem still exists.

The problem is, in the assembly(PTX) code, %rd1 was generated but never used. This means PTX instruction at line 6 is totally unnecessary. And the multiplication instr at line 7 is computationally expensive and should be replaced by a left shift. This occurs at every shared memory access, and should be avoided.

njuffa · April 25, 2014, 3:17am

It appears you are building for an sm_10 target. Is that intentional? Code for sm_1x targets goes through the old Open64-based frontend, rather than the new LLVM-based frontend (NVVM) used for sm_20 and higher platforms.

Note that PTX is merely a hardware-independent intermediate representation. With the CUDA toolchain the PTX code is further compiled with ptxas into machine code (SASS), which is the only code relevant to performance. You cna inspect SASS by useing cuobjdump --dump-sass.

Note that despite the name “ptxas” which may imply an assembler, ptxas is a compiler capable of loop unrolling, strength reduction, if-conversion, dead code elimination, etc along with various platform specific optimizations, instruction scheduling, and register allocation.

Topic		Replies	Views
CUDA V7.0 Release Mode Compile error: nvcc error : 'ptxas' died with status 0xC0000005 (ACCESS_VIOL CUDA Programming and Performance	7	3437	March 29, 2017
CUDA NVCC creates .target 5.0 CUDA Programming and Performance	4	754	January 12, 2017
Nvcc lower version than CUDA causes compiled code runtime error 300 CUDA NVCC Compiler	4	49	September 24, 2024
first install of cuda CUDA Setup and Installation	6	7629	February 12, 2017
Ptxas error while migrating from OptiX 6.0 to 7.2 OptiX	7	1969	October 12, 2021
nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference) CUDA Programming and Performance	8	4765	March 12, 2014
Compilation flags help CUDA Programming and Performance	8	1821	November 10, 2016
running code from cudatoolkit 3.2 to 4.0 -- ptxas error CUDA Programming and Performance	3	3964	August 17, 2011
Slow compile and cudaMalloc CUDA Programming and Performance	8	3691	February 2, 2011
CUDA 1.1 Bug - Compiler crash (ptxas) w/repro CUDA Programming and Performance	16	8600	May 19, 2008

Is this a bug of NVCC 5.5 on code generation/optimization?

Related topics