selfmade cudeMallocHost()?

christoph · July 27, 2007, 7:22am

Hi all,

I have a question regarding the cudaMallocHost() function. How ist this function implemented by Nvidia? Is there a possibillity to rewrite this function with own C or C++ code?
My CUDA application is a kind of add-on to a very big application. I can’t include the CUDA libs in the big application but I want the benefits from the PINNED mode. So the question is, is it possible to rewrite the cudaMallocHost() function with own code to be able to allocate memory in the PINNED mode? Than I could pass the pointer to this memory area easily to my CUDA add-on and benefit from the increased memCpy() performance.

Any suggestions or ideas?

Thanks in advance. Best regrads,
Christoph

prkipfer · July 27, 2007, 9:46am

I don’t know how NVIDIA implemented this. On Linux, you can lock down any memory page using mlock(), see the man page. You can lock any page with kernels 2.6.9 or later. For kernels 2.6.8 or earlier, the locking process needs to have the CAP_IPC_LOCK privilege.

Peter

nwilt · July 29, 2007, 3:11am

We are aware that not all CUDA apps can allocate the memory they’d like to transfer to/from the GPU. The current API also does not enable a single buffer to be DMA’able by more than one GPU.

Fixing this is on the to-do list.

sphyraena · July 29, 2007, 4:29am

If a system has 2 (or more) GPUs, it is sometimes necessary to transfer data from one GPU to another. In the absence of the ability to transfer directly from one GPU to the next (peer-to-peer), the operation should be achievable by:

DMA data from GPU1 to host buffer (pinned).
DMA data from host buffer to GPU2.

The same host buffer is used by more than one GPU. The above comment seems to imply that this cannot be done. Could you please clarify if that’s the case? If not, under what situation will the buffer not be DMA’able by more than one GPU?

Thanks.

nwilt · July 30, 2007, 1:25am

Correct: when peer-to-peer memcpy is emulated by transferring through host memory, the app can achieve DMA bandwidth for one GPU, not both.

christoph · July 30, 2007, 1:44pm

Thanks for your response.

I’m looking for a way to allocate memory in PINNED-mode on a windows system without using the CUDA API. Does someone have an idea how this could be done? Any suggestions (perhaps from Nvidia)?

Thanks in advance and best regards,
Christoph

nwilt · July 30, 2007, 7:04pm

I’m not sure how a user mode process can guarantee that a given memory range will remain page-locked, but that alone will not deliver fast memcpy performance - cudaMallocHost/cuMemAllocHost also map the memory for DMA access by the GPU. We don’t do this at memcpy time because that would be too slow.

For existing memory ranges, a new CUDA API is needed. (and in the works)

sphyraena · July 30, 2007, 7:25pm

So if I need to transfer data between two GPUs, what is the fastest method to do this? How fast would this method achieve?

Thanks!

nwilt · July 31, 2007, 5:51pm

There are too many dependencies on CPU, chipset, and transfer size to know for sure.

DMA is slightly faster for CPU->GPU, so tentatively I would say in order to copy GPU1->GPU2, allocate a CUDA host buffer for GPU2 with cuMemAllocHost/cudaMallocHost. Copying GPU1->CPU then would go at normal speeds, but the copy CPU->GPU2 will go fast.

thezim · February 14, 2008, 4:41am

Hi nwilt,

I know this thread is old, hopefully you’ll still see this. I saw that back in June/July 2007 you indicated that a new API was in the works that would let users specify existing memory to be page-locked and prep’d for DMA access (rather than returning a pointer to newly allocated memory). This is of interest for me, as I’m integrating into another application which performs memory allocations, and I would like to make one of those pre-allocated memory blocks essentially “pinned” (e.g., not just mlock()'d, but fully pinned so the data can be DMA’d also) - and avoid the cost of manually copying the data from the original buffer location into pinned memory (or alternately, to avoid using pinned memory at all). The new API that you mentioned seemed that it would address this very issue.

However, there have been at least one (maybe 2) new CUDA releases since then (at least 1.1), and I haven’t seen a reference to this new API in the documentation. Is it still in the works? Is it possible to get a beta or preliminary code snippets? Alternately, is it possible to get the source code for cudaMallocHost? If I had that, I could easily modify to accept an input argument for a memory pointer rather than malloc’ing new host memory, and then re-link this into the libcudart.so library.

Thanks in advance!

Topic		Replies	Views
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6573	January 22, 2025
cudaMallocHost How to use CUDA Programming and Performance	6	35758	April 26, 2012
Transfer Speed For AWE-Allocated Memory CUDA Programming and Performance	6	3047	March 20, 2013
Async transfers with non-cuda host memory using page-locked memory not cuda memory CUDA Programming and Performance	5	11720	July 4, 2008
DMA Transfer between Third party device, host, and GPU CUDA Programming and Performance hw , cuda	0	711	June 8, 2020
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	9044	November 17, 2021
cuMemAllocHost, how to use ? CUDA Programming and Performance	3	5011	October 29, 2007
Page-Locked Host Memory without using cudaHostAlloc() CUDA Programming and Performance	1	1057	February 17, 2011
Pinned memory concept - windows driver CUDA Programming and Performance	0	1529	January 20, 2012
DriverAPI: How to set the data to an pinned memory. CUDA Programming and Performance	1	933	May 7, 2009

selfmade cudeMallocHost()?

Related topics