Working in emulation but not in device mode

Hi,

I’m developing an image processing application under CUDA.
The program works correctly in device emulation mode (make emu=1, with or without dbg=1), producing the correct resulting image, while running on the device, the resulting image contains a “NaN” in every pixel.
This happens independently of the number of threads (I tried from 1 to 128).

Do you have any idea about what problem it could be?

I work with a GeForce 9800 GTX under Fedora 9. The host is an AMD 64 3200+.

Thank you in advance.
Federico

There are a LOT of possibilities, do you check for errors after your kernel call? (CUT_CHECK_ERROR macro, use dbg=1)

Yes, I do, and it is always OK.

And always get “cudaSuccess” as return value after: cudaMallocPitch, cudaMemset, cudaMemcpy2D.

Thank you.

well, then I would at first try to write a constant value to your pixels, then a value dependend on threadIdx. If that is all ok, you probably have an error in your calculation code.

After some tests, the problem seems not to be in the computation, but still in the initialization of the data.

Now I have just a program allocating matrices and setting them to 0 (float), with the following code:

...

CUDA_SAFE_CALL( CudaErr = cudaMallocPitch( (void **)&VolPtr, &VolPitch, NVoxelsZ*sizeof(float), NVoxelsY*NVoxelsX) );

CUDA_SAFE_CALL( CudaErr = cudaMemset2D(VolPtr, VolPitch, (float)0.0, NVoxelsZ, NVoxelsX*NVoxelsY) );

float *TransVol;

TransVol = (float *) malloc(NVoxelsX*NVoxelsY*VolPitch);

CUDA_SAFE_CALL( CudaErr = cudaMemcpy( (void*)TransVol, (void*)VolPtr, NVoxelsX*NVoxelsY*VolPitch, cudaMemcpyDeviceToHost) );

...

In emulation mode, TransVol contains only 0.0’s, as expected. In device mode (release or debug) it contains only nan’s. I have really no idea, why it happens!

At the moment, NVoxelsX = NVoxelsY =500, NVoxelsZ =1. The returned value for VolPitch is 64.

I solved the NaN problem!

It was a wrong use of cudaMemCpy call. By correcting the last line as follows:

CUDA_SAFE_CALL( CudaErr = cudaMemcpy2D( TransVol, VolPitch, VolPtr, VolPitch, NVoxelsZ, NVoxelsX*NVoxelsY, cudaMemcpyDeviceToHost) );

it works.