Cudamalloc time consuming?

Hi , it is obvious that cudamemcpy which copies the content from host to the device is very time consuming. But what about allocating the memory and initialising on the device itself? Is cudamalloc also very time consuming?


You cannot allocate device memory from within the device code (kernel).

Device memory can only be allocated from host side. But its much faster then cudamemcpy , at-least for the work I am doing ( I allocate and memcpy large arrays ~ 500 mb) .

Yes . I am not allocating within the kermnel. I am allocatimg it in the kernel and initialising within the device, instead of initialising in the host and copying it.

ohh pardon me for getting you wrong .

I too do what you are doing. I just malloc memory on host side and then work on it on device and copy it back to host side.

malloc is much faster than memcpy, as far as i know.

You could have spent 10 minutes profiling it for yourself, but…

I can’t speak for the Runtime API (never used it, never will) - but in the Driver API the equivalent function call (cuMemAlloc(Pitch)) does not take much time at all, anywhere between 5-10 micro seconds, and this doesn’t appear to vary depending on the size of memory allocated (or maybe I’m just allocating very small blocks of memory compared to most people?)

So no, allocating memory is not costly at all on the client side - presumably because it’s asynchronous. I assume it does add minor overhead to kernel launch times though (which is where the memory has to be finished allocating by).

Unfortunately, like most of the CUDA documentation - the memory allocation functions don’t have any useful information regarding the details of it’s functionality - so that’s all I can tell you for now.

I can confirm that the runtime of cudamalloc is almost independent of the size allocated. cudaMemcpy’s runtime increases linearly with allocated size (about 1ms/MB for me) For very small allocations (~2KB and less), however, cudaMalloc seems to use a different mechanism which takes very little time (about the same amount of time as for the cudaMemcpy of that small bit of data).