I happened to test such kind of program in Matlab.
tic; fft2(img_d); toc % img_d is a gpuArray on GPU ....... %Set the block and grid sizes of my kernel ........ % invoke my kernel tic img_d=feval(myKernelName, .......); toc % do fft2 on gpu again tic; fft2(img_d); toc % img_d is a gpuArray on GPU
The result is stange:
the first fft2 on gpu costs:
Elapsed time is 0.000557 seconds.
but after invoke my kernel ,the second fft2 on gpu costs:
Elapsed time is 0.074028 seconds.
If I add such line after invoking line:
img = gather(img_d)
and then measure the time for the second fft2 on gpu
the time looks right:
Elapsed time is 0.000392 seconds.
What is wrong with my kernel?