cudaMallocHost increases kernel execution time


I am a newbie in cuda programming using to decrease my image correction algorithm execution time. I have launched a kernel with a grid of size 64 and block with 32 threads. For memory allocation on device i had used cudaMalloc and copied the image on device side.

On profiling the .cu file, my kernel execution time was 1.775 ms.

I thought to further improve it, so i have used cudaMallocHost and to free the memory cudaFreeHost. But on profiling what i saw was surprising, it increases kernel execution time from 1.775 to 4.806 ms. I do not understand why it has increased.

Please, can anyone help me out with this problem. The time should decrease as cudaMallocHost transfer directly from pinned memory to device.

wouldn’t the kernel operating on unified memory now cause some kind of on-demand paging that effectively slows down the kernel while paging is performed?

EDIT: cudaMallocHost allocates pinned memory, not unified memory. So disregard the above statement please.

Your statements don’t give a clear picture. A code example is usually better.

Normally when we are copying data from host to device, there is a host side allocation, perhaps performed with malloc, and a device side allocation, perhaps performed with cudaMalloc.

If what you are saying is that you replaced the host side allocation (malloc) with cudaMallocHost, that should not make the kernel run slower, assuming you still have the device allocation with cudaMalloc and you are copying data from host to device.

OTOH if you replaced the device side allocation (cudaMalloc) with cudaMallocHost, and let your kernel operate out of that pinned allocation, that will definitely make your kernel run slower, in most cases.

i’m not a specialist in this area, but is it possible that pinned allocs made copying asynchronous which resulted in accounting both copying+executing as the time of kernel execution, while first version clearly divided coping and execution?

of course, full profile data should made an answer simpler :) now we are just guessing here and there