I have made an application in Visual studio 2005. The application reads Y image from a file and gives the Y-image to 8800 GTX for processing. I have copied the Y-image to the device memory using cudaMemcpy API of CUDA library. I have written three kernels to calculate 2 different parameter. One kernel(lets name it as kernel1) is called to calculate a parameter and other two kernels(kernel2 and kernel3) are called sequenctially to calculate other parameter.
Time taken by kernel1 is around 0.03 milli seconds. Time taken by kernel2+kernel3 execution (when called after kernel1) is 37 milli seconds. But when kernel2+kernel3 is called without calling kernel1, it takes 0.1 milli seconds. The inputs to “kernel1” and “kernel2+kernel3” are same.
Please guide me how can I get proper performance(0.03 milli seconds for kernel1 and 0.1 milli seconds for kernel2+kernel3) even when I cal the kernels sequentially.
While debugging this issue I have also come across one more problem. When I run the same piece of code multiple times even without re-compiling it, the results are not produced consistently. :wacko: The same piece of code when ran on device emulation mode, it produces proper results every time.
Please let me know how can I resolve this issue.
I know its a very long post, but I have tried to provide maximum information so that I can get a quick answer. :D