It seems that I found a bug of nvcc 5.5 in code generation.
I tested nvcc 5.5 on a x64 openSUSE 13.1 with the following (very simple) code. I installed CUDA toolkit from nVidia’s CUDA repository for openSUSE.
//this is test.cu
__device__ int test_device() {
__shared__ int z[1024];
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
return z[i];
}
__global__ void test(int output[]) {
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
output[i] = test_device();
}
int main(int argc, char *argv[])
{
int *d_output;
cudaMalloc(&d_output, sizeof(int) * 1024);
int threadsPerBlock = 256;
int blocksPerGrid = 1024 / threadsPerBlock;
test<<<blocksPerGrid, threadsPerBlock>>>(d_output);
cudaFree(d_output);
}
I generated its PTX code. The compiling command is:
nvcc test.cu -ptx -o test.ptx
and I got a PTX file like this:
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.u32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
cvt.u64.u32 %rd1, %r3;
mul.wide.u32 %rd2, %r3, 4;
mov.u64 %rd3, __cuda_local_var_31148_33_non_const_z__0;
add.u64 %rd4, %rd2, %rd3;
ld.shared.s32 %r4, [%rd4+0];
ld.param.u64 %rd5, [__cudaparm__Z4testPi_output];
add.u64 %rd6, %rd5, %rd2;
st.global.s32 [%rd6+0], %r4;
Note that in line 6, integer in r3 was converted to 64-bit and stored in rd1. But after that rd1 was never used!
And there’s also problem in line 7, which multiplied r3 by 4.
But for strength reduction:
shl.b64 %rd2, %rd1, 2
is a better solution.
Do you get the same result with your nvcc compiler? Is this a bug of NVCC?