Does the page-lock memory by cudaHostRegister slow than cudaMallocHost?

2007303105 · June 28, 2023, 12:32am

Does the page-locked memory that was register by cudaHostRegister has low effective access than the page-lock memory by cudaMallocHost?

Robert_Crovella · June 29, 2023, 1:52pm

There isn’t any difference between the two that I am aware of.

2007303105 · June 30, 2023, 12:20am

But in my test case, one is malloc+cudaHostRegister, another is cudaMallocHost, then i used cudaMemcpy to move the data between device and host, i found that the speed of malloc+cudaHostRegister is slower than cudaMallocHost about 2.9%. Does cudaHostRegister need special attention when using it?

Thanks!

njuffa · June 30, 2023, 1:07am

Unless the difference of 2.9% is the average of hundreds of different instances of the each variant (with different sizes, addresses, platforms), I would claim that one cannot draw any conclusion from this number, even more so as this difference is just beyond the noise level of ±2%.

2007303105 · June 30, 2023, 1:14am

Actually, i try to malloc different size and then call cudaMemcpy from 10000 to 30000 times, all the result shows the difference in Red Hat Linux and Tesla A100. I didn’t test on other platform.

njuffa · June 30, 2023, 1:35am

Since it was not mentioned explicitly, what was actually timed? The execution time of cudaMemcpy() calls? The speed of memory copies is subject to interactions between the mapping of memory pages and multiple layers of the memory subsystem. While averaging thousands of cudaMemcpy() calls will give us a solid idea of the copy performance, it does not provide us with a good idea of whether and how the copy performance depends on allocation method. For that we would have to examine hundreds of different allocation cases.

It is entirely possible (and likely, I would think) that separate calls to malloc() and cudaHostRegister() result in a different page mapping compared to calling cudaMallocHost(). A reasonable null hypothesis would be that this could result in either allocation method providing a slight performance advantage to subsequent memory copies roughly half the time.

While I do not have a good alternative hypothesis (not knowing the details of the underlying OS calls), it is certainly possible that some systematic difference exists between the two methods, and that this consistently favors the use of cudaMallocHost() for better cudaMemcpy() performance . As far as I am concerned, the data provided in this thread is insufficient to draw such a conclusion. However, if it is conclusive from your perspective, you could simply always use cudaMallocHost().

2007303105 · June 30, 2023, 1:51am

njuffa:

Since it was not mentioned explicitly, what was actually timed? The execution time of cudaMemcpy() calls? The speed of memory copies is subject to interactions between the mapping of memory pages and multiple layers of the memory subsystem. While averaging thousands of cudaMemcpy() calls will give us a solid idea of the copy performance, it does not provide us with a good idea of whether and how the copy performance depends on allocation method. For that we would have to examine hundreds of different allocation cases.

It is entirely possible (and likely, I would think) that separate calls to malloc() and cudaHostRegister() result in a different page mapping compared to calling cudaMallocHost(). A reasonable null hypothesis would be that this could result in either allocation method providing a slight performance advantage to subsequent memory copies roughly half the time.

While I do not have a good alternative hypothesis (not knowing the details of the underlying OS calls), it is certainly possible that some systematic difference exists between the two methods, and that this consistently favors the use of cudaMallocHost() for better cudaMemcpy() performance . As far as I am concerned, the data provided in this thread is insufficient to draw such a conclusion. However, if it is conclusive from your perspective, you could simply always use cudaMallocHost().

Thanks for reply.
You are right, the actual time is the total execution time of cudaMemcpy. For some reason, I must use the way malloc+cudaHostRegister, so i try to get the reason why malloc+cudaHostRegister is slow, and then to improve the performance of malloc+cudaHostRegister.

njuffa · June 30, 2023, 1:57am

Let’s assume the performance difference of 3% for copy operations is real (not just an artifact of a particular configuration)and reproduceable across all your intended target platforms.

My expectation would be that the difference in application-level performance resulting from that is below noise level, as copying data is just part of what the application does. Beyond that, I would expect the impact of copy performance of asynchronous copies to be even less due to overlap with host activity and copies in different directions.

2007303105 · June 30, 2023, 3:09am

I repalce the malloc() with posix_memalign(), and the align size as the pagesize of system, then used cudaHostRegister to register the memory as page-locked, the execution time of cudaMemcpy is same with the cudaMallocHost, I still don’t know the reason. How do you feel about this case?
Thanks!

Robert_Crovella · June 30, 2023, 3:59am

I do believe that malloc() does not guarantee alignment with the a system page boundary, but I don’t know much about posix_memalign() (I also don’t know what cudaMallocHost()/cudaHostAlloc() does in this regard). It’s hard to imagine that would make a difference except for very specific transfer sizes and transfer starting points. If there is an issue here and posix_memalign() is working satisfactorily for you, perhaps you should use that.

Topic		Replies	Views
cudaHostRegister/Unregister vs Host Memcpy to Pagelocked CUDA Programming and Performance	3	2691	November 26, 2012
Memcopy between two cuda managed addresses much faster than between cuda managed and not managed CUDA Programming and Performance	2	124	March 8, 2025
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	713	March 7, 2019
Using cudaHostRegister() in CUDA 4.0 CUDA 4.0 CUDA Programming and Performance	16	30588	January 25, 2018
cudaHostAllocMapped CUDA Programming and Performance	5	8360	October 15, 2009
Poor performance cudaHostUnregister CUDA Programming and Performance	6	7233	September 20, 2011
Why cudaHostRegister can have a huge difference in performance? CUDA Programming and Performance pcie , linux	4	368	September 27, 2024
Is cudaHostAlloc() fast? CUDA Programming and Performance	4	965	March 28, 2024
malloc() + cuMemHostRegister() faster than cuMemAllocHost() CUDA Programming and Performance	0	1142	October 9, 2013
cudaHostRegister block very long time CUDA Programming and Performance	1	835	June 2, 2020

Does the page-lock memory by cudaHostRegister slow than cudaMallocHost?

Related topics