Inconsistent behaviour of 8800 GTX hardware Inconsistent output for the same code

I have made an application in Visual studio 2005. The application reads Y image from a file and gives the Y-image to 8800 GTX for processing. I have copied the Y-image to the device memory using cudaMemcpy API of CUDA library. I have written three kernels to calculate 2 different parameter. One kernel(lets name it as kernel1) is called to calculate a parameter and other two kernels(kernel2 and kernel3) are called sequenctially to calculate other parameter.

Time taken by kernel1 is around 0.03 milli seconds. Time taken by kernel2+kernel3 execution (when called after kernel1) is 37 milli seconds. But when kernel2+kernel3 is called without calling kernel1, it takes 0.1 milli seconds. The inputs to “kernel1” and “kernel2+kernel3” are same.

Please guide me how can I get proper performance(0.03 milli seconds for kernel1 and 0.1 milli seconds for kernel2+kernel3) even when I cal the kernels sequentially.

While debugging this issue I have also come across one more problem. When I run the same piece of code multiple times even without re-compiling it, the results are not produced consistently. :wacko: The same piece of code when ran on device emulation mode, it produces proper results every time.

Please let me know how can I resolve this issue.

I know its a very long post, but I have tried to provide maximum information so that I can get a quick answer. :D

For proper timings in CUDA you should:

  • make sure you do cudaSynchronizeThreads() before starting the timer, and before stopping it, otherwise timings make no sense
  • prerun the kernel one time outside the timing loop, the first time you run a kernel CUDA does some initialisations

There may be different reasons for such behaviour:

  • Your kernel accesses uninitialized/unallocated device memory (as a result of incorrect indexing into something, for example)

  • You have shared memory conflicts (undetectable in device emulation because on host threads are executed sequentally)

  • Your kernel have some syncronization issues.

Are you using cudaThreadSyncronize() somewhere? I guess not.

Then from what you’ve said it follows that kernel1 takes approx. 37 ms, and there are no clues about kernel2 and kernel3.