I am a newbie in cuda programming using to decrease my image correction algorithm execution time. I have launched a kernel with a grid of size 64 and block with 32 threads. For memory allocation on device i had used cudaMalloc and copied the image on device side.
On profiling the .cu file, my kernel execution time was 1.775 ms.
I thought to further improve it, so i have used cudaMallocHost and to free the memory cudaFreeHost. But on profiling what i saw was surprising, it increases kernel execution time from 1.775 to 4.806 ms. I do not understand why it has increased.
Please, can anyone help me out with this problem. The time should decrease as cudaMallocHost transfer directly from pinned memory to device.