Different Outputs with -deviceemu mode

Dear Experts,

I am quite new to CUDA and I am trying to implement a simple Moving Average FIR Filter. In order to verify the correctness of my code, I have designed two versions of the fir filter; one runs on Host and the other on GPU. When I run my code i get the following output:

Host Output: [ 0.000000 0.000000 0.000000 0.000000 0.333000 0.666000 0.999000 0.999000 0.999000 0.666000 0.333000 0.000000 0.000000 0.000000 0.000000]

Gpu Output: [ nan nan nan 0.000000 0.333000 0.666000 0.999000 0.999000 0.999000 0.666000 0.333000 0.000000 0.000000 0.000000 0.000000]

But when I compile my code in device emulation mode, I get the following output:

Host Output: [ 0.000000 0.000000 0.000000 0.000000 0.333000 0.666000 0.999000 0.999000 0.999000 0.666000 0.333000 0.000000 0.000000 0.000000 0.000000]

Gpu Output: [ 0.000000 0.000000 0.000000 0.000000 0.333000 0.666000 0.999000 0.999000 0.999000 0.666000 0.333000 0.000000 0.000000 0.000000 0.000000]

Which is exactly same as the host code. My questions are as follows:

1- why in the first case I receive nan outputs ? I mean what can be a reason for this.

2- The statement int idx = blockIdx.x * blockDim.x + threadIdx.x; can be used to traverse the array data structure e.g. data. If I want to access data[0] + data[1] + data[2] , can I just use data[idx] + data[idx+1] + data[idx+2] ?

Regards,

Sanwar

You’ll have to post some code for an answer. There could be a number of reasons. One good starting point might be to run in emulation using valgrind.

I believe this is quite a common way of doing things - provided you take care to ensure that you don’t fall off the end of the data array.

Hi,
I usually get vast differences between Host and Device output when using single precision floating point numbers. I hope I’m right saying that floating point arithmetic on an x86 CPU uses 80 bits internally instead of only 32 bit like the GPU.

Regards

Navier

Did you anywhere in the kernel code access a pointer that points to host memory? It’ll results in unknown behavior if executed on the device, but is perfectly fine in emulation mode. Also, if your card doesn’t have compute capability 1.3, double precision is not supported in device code; but again emulation mode doesn’t care.