malloc() + cuMemHostRegister() faster than cuMemAllocHost()

Hey guys,

I am working on a high performance software using CUDA and I am currently optimizing the CPU part. The issue is that I need a relatively large array of pinned memory where the output of the kernel is copied to after execution. This takes quite a long time to allocate using cuMemAllocHost which is problematic for my flow.

Therefore, I tried the following:
1.) Don’t use pinned memory but only allocate the memory using malloc(). This is faster by a factor of > 100 on the CPU but then the transfer GPU->CPU takes a lot longer which is of course consistent with the CUDA documentation. I could use that to balance CPU and GPU runtimes but it is not a very good solution.
2.) Don’t use cuMemAllocHost but malloc() + cuMemHostRegister() to pin memory. To my surprise: This is a factor of 2 faster than using cuMemAllocHost! The transfer time GPU->CPU is unchanged and therefore, I would assume that this is the optimal solution for my use case.

Can somebody explain to me why malloc() + cuMemHostRegister() is faster? Are there any side effects to this strategy that I have not yet considered?