malloc() + cuMemHostRegister() faster than cuMemAllocHost()

Markus_Wagner · October 9, 2013, 9:25am

Hey guys,

I am working on a high performance software using CUDA and I am currently optimizing the CPU part. The issue is that I need a relatively large array of pinned memory where the output of the kernel is copied to after execution. This takes quite a long time to allocate using cuMemAllocHost which is problematic for my flow.

Therefore, I tried the following:
1.) Don’t use pinned memory but only allocate the memory using malloc(). This is faster by a factor of > 100 on the CPU but then the transfer GPU->CPU takes a lot longer which is of course consistent with the CUDA documentation. I could use that to balance CPU and GPU runtimes but it is not a very good solution.
2.) Don’t use cuMemAllocHost but malloc() + cuMemHostRegister() to pin memory. To my surprise: This is a factor of 2 faster than using cuMemAllocHost! The transfer time GPU->CPU is unchanged and therefore, I would assume that this is the optimal solution for my use case.

Can somebody explain to me why malloc() + cuMemHostRegister() is faster? Are there any side effects to this strategy that I have not yet considered?

Topic		Replies	Views
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	9012	November 17, 2021
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7733	January 5, 2012
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	873	October 8, 2018
cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!! CUDA Programming and Performance	5	15152	July 3, 2007
CPU operation is very slow on memory allocated by cudaMallocHost CUDA Programming and Performance	0	408	October 9, 2018
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3309	September 16, 2010
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	852	March 28, 2024
cudaMallocHost How to use CUDA Programming and Performance	6	35697	April 26, 2012
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2293	August 26, 2011
Pinned memory slows CPU computation Jetson TK1	5	1508	January 8, 2016

malloc() + cuMemHostRegister() faster than cuMemAllocHost()

Related topics