Hi,
compiling the following code snippet
extern "C" __global__ void
test_kernel(double * C, const double * A)
{
const int tix = threadIdx.x;
float tmp = (float)tix;
const double *trackA = A + (int)tmp;
*C = trackA[0];
}
saved as test.cu with
nvcc -arch sm_35 -v -Xptxas="-v" -keep -ftz=false -cubin test.cu
I get the following strange output
ptxas info : 11 bytes gmem
ptxas info : Compiling entry function 'test_kernel' for 'sm_35'
ptxas info : Function properties for test_kernel
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 6 registers, 336 bytes cmem[0]
Inspecting the ptx file I find
//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Fri May 10 02:50:05 2013 (1368147005)
// Cuda compilation tools, release 5.5, V5.5.0
//
.version 3.2
.target sm_35
.address_size 64
.file 1 ".../test.cu", 1375867951, 197
.file 2 "/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_device_runtime_api.h", 1369186301, 7655
.file 3 "/usr/local/cuda/bin/../targets/x86_64-linux/include/device_functions.h", 1369186301, 185228
.global .align 1 .b8 $str[11] = {95, 95, 67, 85, 68, 65, 95, 70, 84, 90, 0};
i.e. there is a string variable ‘__CUDA_FTZ’ written to global memory. The problem seems to be the
the part “A + (int)tmp”. Setting the compiler option -ftz=false or -ftz=true does not make any difference. I did not see this behaviour under CUDA 5.0. I also do not understand why the cuda_device_runtime_api.h is included. Thanks for any help on this.