Using async memcopy without using cudaMallocHost/cudaHostAlloc?

Hi all,

I’m trying to use the asynchronous cuda memory copy API(cudaMemcpyAsync).
As described in the cuda references, the host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
But I need to avoid calling these functions because the host memory I’m using for async copying is from typical malloc() call.
So I manually page-locked these pre-allocated host memory using mlock() system call,
but it seems that the device does not realize that the memory has been page-locked,
so calling asynchronous memory copy generates errors.
I guess the device only recognizes the memory to be page locked only when cudaMallocHost or cudaHostAlloc is used for allocating the memory.
Is there a way to use a memory chunk allocated from malloc (not from cudaMallocHost/cudaHostAlloc) for asynchronous copy?


Sorry but this is not going to work … as you said, the driver needs to pin memory by itself. Otherwise how could it make sure that the memory mapping hasn’t been modified while it’s being used ? Unfortunately i fear there is just no way to avoid explicit calls to cudaMallocHost yet. Please note that calling mlock just adds a reference to the refcnt on each page in the virtual memory, i think (but i’m not 100% sure) that its semantic is that the page will not be swapped, but it does not garantee that the page physical address will be constant (for instance if we have a page migration on another numa node, the physical page is not the same, but the virtual address is not swapped.

This is really some tricky thing that does not look like something that will be solved anytime soon (it would be great if i’'m wrong !). Just a small question is it really impossible for you to use cudaMallocHost ? Even in Fortran, there are some tricks for instance …


Thanks for the reply.

The reason I’m trying not to use cudaMallocHost is due to its huge overhead compared to malloc(),

and I don’t need to page-lock all the memory I allocate.

I guess I should find out another way to handle this problem.



If the only problem is an overhead issue, perhaps you can preallocate a large buffer and manage it by hand if your allocation scheme is simple enough. Of course that’s easier said than done …