Program gives wrong answer except with emulation


I’m sure I’m doing something stupid here, but I have a simple program, that works just fine in c++ and the kernel works fine in emulation mode. However, it gives me totally different (and wrong) numbers when I run it on the gpu. I’m only using one thread for simplicity’s sake and its obviously not optimized for speed. The project is attached. Could someone give me an idea of what’s going wrong here? Thanks.

dim3 grid(1, 1, 1);

    dim3 onethread(1,1,1);

 EDCalc<<<grid, onethread >>> (cudagenomearray, cudaedspacingarray,cudarhoarray, cudank, cudadistarray, ptsperiter, boxes,1.0f/rough );


__global__ void

EDCalc( float* genome,float* edspacing,float* rhoarray, float2* dnk, float* distarray, int ptsperthread, int refllayers, float roughness )


      for(int i = 0; i < ptsperthread; i++)


  	dnk[i].x = 0.0f;

  	dnk[i].y = 0.0f;

  	float temp = 0;

  	for(int k = 0; k < refllayers; k++)



    float dist = (edspacing[i]-distarray[k] )*roughness;

    if(dist > 6.0f)


    	temp += (rhoarray[k])*(2.0f);


    else if (dist > -6.0f)


    	temp += (rhoarray[k])*(1.0f+erff(dist));



  	dnk[i].x = temp;




I’m using SDK 2.0 beta on Vista with a geforce 8600 GT (264 KB)

The basic programming model is that threads issue in groups of 32 called warps. I’ve never actually tried it, but it would not be surprising if the driver promoted the number of threads to the closest multiple of 32. That would be consistent with how the programming guide and the occupancy calculator determine the maximum number of registers.

Hi, thanks for the response. Shouldn’t the emulator have the same response then?

Are you aware that GPU currently supports only single-precision? This may cause differences in results.

BTW, CUDA is not supported on 6600 GT, do you mean 8600?

haha, yes i mean the 8600. nice catch. The precision shouldn’t matter here. I’m using floats in emulation and the c++ part also. Not only that, I’m pretty sure its not an underflow because the final output file reports all of the nk’s divided by the last value (for some reason, this calculation is always reported like this). These ratios are as expected. I didn’t even realize anything was wrong until I plugged it into the next step. So whatever I (or CUDA) am doing wrong, it is consistently wrong. Thanks for any help.

Even if you use float in host code all intermediate FPU uses extended (80-bit) precision for intermediate calculations.

You can use _set_controlfp() to instruct compiler to use single precision. GIve it a try and check if your results are still much different on GPU and in emulation mode.

Thanks for the response. I add


right before my call to the kernel in emulation mode, and I still get the correct answer in emulation and the wrong answer from the GPU. Any other ideas? Thanks.

It could be a problem where you don’t have all arrays copied to the GPU device/constant memory. This would not show up in emulation mode because kernels can access host memory there.

Well, i figured it out. In my kernel, I accidentally had one of my variables as a double instead of a float. Now, while this was accidental, I was under the impression that CUDA just translated to float if you used doubles. This gives the completely wrong answer if you use a double. Very odd. Thanks for all of the help.