strange exp() induces indirect side modification of other variable on cuda-6.5 and sm_2.0

Hello,
I created simple test:

#include <stdio.h>
__device__ __noinline__ void test_func(double x) {
  double g = exp(x);
  double E=1.0; 
  if(x<0) E=0.0;
  printf("g=%f E=%f\n", g, E);
}
__global__ void kernel(int ix, double *p) {
  test_func(1);
  p[ix%3]=0;
}
int main(){
  double *p;
  cudaMalloc((void**)&p, sizeof(double)*1000);
  kernel<<<1,1>>>(0,p); cudaDeviceSynchronize();
  return 0;
}

After separate compilation/linking (this important!) by cuda-6.5 and running on Tesla C2050 I have an unexpected E value printed:

/common/cuda-6.5/bin/nvcc -arch=sm_20 -o test.o -dc test.cu
/common/cuda-6.5/bin/nvcc -arch=sm_20 -o test.x test.o
./test.x
g=2.718282 E=inf

This defect doesn’t occur for cuda-5.0 and for some Keplers and Maxwells I tested.
If anyone has Fermi architecture card, and cuda-6.5 or 7, please run this test. I want to know is it really nvcc error?

From the description, this has the appearance of a compiler bug. To narrow this down a bit more, what happens if you change the compilation stage to

/common/cuda-6.5/bin/nvcc -arch=sm_20 -o test.o -Xptxas -O0 -dc test.cu

This turns off optimizations in the compiler backend (which compiles PTX to SASS). The compiler backend contains many architecture-specific transformations, which could explain why you observe a difference when building for Fermi platforms rather than Kepler or Maxwell platforms. If the problem goes away with -Xptxas -O0, that would also seem to exclude a link-time error.

For the record, when I compile the code with CUDA 7.5, and then run the resulting executable on an sm_50 class GPU, it returns the correct result, meaning E=1.0.

Thanks for remarking. This bug is probably due to optimization (-O0 and -O1 give correct result, but -O2 and -O3 don’t)

Seems to be a bug in CUDA 6.5 that was fixed in CUDA 7.0 and 7.5:

$ /usr/local/cuda-6.5/bin/nvcc -arch=sm_20 -o t965.o -dc t965.cu
$ /usr/local/cuda-6.5/bin/nvcc -arch=sm_20 -o t965 t965.o
$ ./t965
g=2.718282 E=inf
$ /usr/local/cuda-7.0/bin/nvcc -arch=sm_20 -o t965.o -dc t965.cu
$ /usr/local/cuda-7.0/bin/nvcc -arch=sm_20 -o t965 t965.o
$ ./t965
g=2.718282 E=1.000000
$ /usr/local/cuda-7.5/bin/nvcc -arch=sm_20 -o t965.o -dc t965.cu
$ /usr/local/cuda-7.5/bin/nvcc -arch=sm_20 -o t965 t965.o
$ ./t965
g=2.718282 E=1.000000
$

Tested on Fedora 20 and Quadro5000 (cc2.0 Fermi device) with driver 352.39