selfmade cudeMallocHost()?

Hi all,

I have a question regarding the cudaMallocHost() function. How ist this function implemented by Nvidia? Is there a possibillity to rewrite this function with own C or C++ code?
My CUDA application is a kind of add-on to a very big application. I can’t include the CUDA libs in the big application but I want the benefits from the PINNED mode. So the question is, is it possible to rewrite the cudaMallocHost() function with own code to be able to allocate memory in the PINNED mode? Than I could pass the pointer to this memory area easily to my CUDA add-on and benefit from the increased memCpy() performance.

Any suggestions or ideas?

Thanks in advance. Best regrads,

I don’t know how NVIDIA implemented this. On Linux, you can lock down any memory page using mlock(), see the man page. You can lock any page with kernels 2.6.9 or later. For kernels 2.6.8 or earlier, the locking process needs to have the CAP_IPC_LOCK privilege.


We are aware that not all CUDA apps can allocate the memory they’d like to transfer to/from the GPU. The current API also does not enable a single buffer to be DMA’able by more than one GPU.

Fixing this is on the to-do list.

If a system has 2 (or more) GPUs, it is sometimes necessary to transfer data from one GPU to another. In the absence of the ability to transfer directly from one GPU to the next (peer-to-peer), the operation should be achievable by:

  1. DMA data from GPU1 to host buffer (pinned).

  2. DMA data from host buffer to GPU2.

The same host buffer is used by more than one GPU. The above comment seems to imply that this cannot be done. Could you please clarify if that’s the case? If not, under what situation will the buffer not be DMA’able by more than one GPU?


Correct: when peer-to-peer memcpy is emulated by transferring through host memory, the app can achieve DMA bandwidth for one GPU, not both.

Thanks for your response.

I’m looking for a way to allocate memory in PINNED-mode on a windows system without using the CUDA API. Does someone have an idea how this could be done? Any suggestions (perhaps from Nvidia)?

Thanks in advance and best regards,

I’m not sure how a user mode process can guarantee that a given memory range will remain page-locked, but that alone will not deliver fast memcpy performance - cudaMallocHost/cuMemAllocHost also map the memory for DMA access by the GPU. We don’t do this at memcpy time because that would be too slow.

For existing memory ranges, a new CUDA API is needed. (and in the works)

So if I need to transfer data between two GPUs, what is the fastest method to do this? How fast would this method achieve?


There are too many dependencies on CPU, chipset, and transfer size to know for sure.

DMA is slightly faster for CPU->GPU, so tentatively I would say in order to copy GPU1->GPU2, allocate a CUDA host buffer for GPU2 with cuMemAllocHost/cudaMallocHost. Copying GPU1->CPU then would go at normal speeds, but the copy CPU->GPU2 will go fast.

Hi nwilt,

I know this thread is old, hopefully you’ll still see this. I saw that back in June/July 2007 you indicated that a new API was in the works that would let users specify existing memory to be page-locked and prep’d for DMA access (rather than returning a pointer to newly allocated memory). This is of interest for me, as I’m integrating into another application which performs memory allocations, and I would like to make one of those pre-allocated memory blocks essentially “pinned” (e.g., not just mlock()'d, but fully pinned so the data can be DMA’d also) - and avoid the cost of manually copying the data from the original buffer location into pinned memory (or alternately, to avoid using pinned memory at all). The new API that you mentioned seemed that it would address this very issue.

However, there have been at least one (maybe 2) new CUDA releases since then (at least 1.1), and I haven’t seen a reference to this new API in the documentation. Is it still in the works? Is it possible to get a beta or preliminary code snippets? Alternately, is it possible to get the source code for cudaMallocHost? If I had that, I could easily modify to accept an input argument for a memory pointer rather than malloc’ing new host memory, and then re-link this into the library.

Thanks in advance!