Is this a bug of NVCC 5.5 on code generation/optimization?

It seems that I found a bug of nvcc 5.5 in code generation.
I tested nvcc 5.5 on a x64 openSUSE 13.1 with the following (very simple) code. I installed CUDA toolkit from nVidia’s CUDA repository for openSUSE.

//this is
__device__ int test_device() {
    __shared__ int z[1024];
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    return z[i];

__global__ void test(int output[]) {
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    output[i] = test_device();

int main(int argc, char *argv[])
    int *d_output;
    cudaMalloc(&d_output, sizeof(int) * 1024);

    int threadsPerBlock = 256;
    int blocksPerGrid = 1024 / threadsPerBlock;
    test<<<blocksPerGrid, threadsPerBlock>>>(d_output);

I generated its PTX code. The compiling command is:

nvcc -ptx -o test.ptx

and I got a PTX file like this:

mov.u16 	%rh1, %ctaid.x;
mov.u16 	%rh2, %ntid.x;
mul.wide.u16 	%r1, %rh1, %rh2;
cvt.u32.u16 	%r2, %tid.x;
add.u32 	%r3, %r2, %r1;
cvt.u64.u32 	%rd1, %r3;
mul.wide.u32 	%rd2, %r3, 4;
mov.u64 	%rd3, __cuda_local_var_31148_33_non_const_z__0;
add.u64 	%rd4, %rd2, %rd3;
ld.shared.s32 	%r4, [%rd4+0];
ld.param.u64 	%rd5, [__cudaparm__Z4testPi_output];
add.u64 	%rd6, %rd5, %rd2; 	[%rd6+0], %r4;

Note that in line 6, integer in r3 was converted to 64-bit and stored in rd1. But after that rd1 was never used!
And there’s also problem in line 7, which multiplied r3 by 4.
But for strength reduction:

shl.b64		%rd2, %rd1, 2

is a better solution.
Do you get the same result with your nvcc compiler? Is this a bug of NVCC?

do you get the same code when passing optimization flags such as -O2 or -O3 ?

When I compile your code, I do get the same ptx. However the compiler spits out a warning:

variable “z” is used before its value is set

If I modify your device function to initialize z, I don’t see this observation.

It’s not the problem of optimization flag. Device code is optimized at -O3 level by default.

Yes, my code triggered compilation warning. But it doesn’t matter.
Even if I insert


between line 4 and 5 to initialize the shared memory, this problem still exists.

The problem is, in the assembly(PTX) code, %rd1 was generated but never used. This means PTX instruction at line 6 is totally unnecessary. And the multiplication instr at line 7 is computationally expensive and should be replaced by a left shift. This occurs at every shared memory access, and should be avoided.

It appears you are building for an sm_10 target. Is that intentional? Code for sm_1x targets goes through the old Open64-based frontend, rather than the new LLVM-based frontend (NVVM) used for sm_20 and higher platforms.

Note that PTX is merely a hardware-independent intermediate representation. With the CUDA toolchain the PTX code is further compiled with ptxas into machine code (SASS), which is the only code relevant to performance. You cna inspect SASS by useing cuobjdump --dump-sass.

Note that despite the name “ptxas” which may imply an assembler, ptxas is a compiler capable of loop unrolling, strength reduction, if-conversion, dead code elimination, etc along with various platform specific optimizations, instruction scheduling, and register allocation.