Does the page-locked memory that was register by cudaHostRegister has low effective access than the page-lock memory by cudaMallocHost?
There isn’t any difference between the two that I am aware of.
But in my test case, one is malloc+cudaHostRegister, another is cudaMallocHost, then i used cudaMemcpy to move the data between device and host, i found that the speed of malloc+cudaHostRegister is slower than cudaMallocHost about 2.9%. Does cudaHostRegister need special attention when using it?
Thanks!
Unless the difference of 2.9% is the average of hundreds of different instances of the each variant (with different sizes, addresses, platforms), I would claim that one cannot draw any conclusion from this number, even more so as this difference is just beyond the noise level of ±2%.
Actually, i try to malloc different size and then call cudaMemcpy from 10000 to 30000 times, all the result shows the difference in Red Hat Linux and Tesla A100. I didn’t test on other platform.
Since it was not mentioned explicitly, what was actually timed? The execution time of cudaMemcpy()
calls? The speed of memory copies is subject to interactions between the mapping of memory pages and multiple layers of the memory subsystem. While averaging thousands of cudaMemcpy()
calls will give us a solid idea of the copy performance, it does not provide us with a good idea of whether and how the copy performance depends on allocation method. For that we would have to examine hundreds of different allocation cases.
It is entirely possible (and likely, I would think) that separate calls to malloc()
and cudaHostRegister()
result in a different page mapping compared to calling cudaMallocHost()
. A reasonable null hypothesis would be that this could result in either allocation method providing a slight performance advantage to subsequent memory copies roughly half the time.
While I do not have a good alternative hypothesis (not knowing the details of the underlying OS calls), it is certainly possible that some systematic difference exists between the two methods, and that this consistently favors the use of cudaMallocHost()
for better cudaMemcpy()
performance . As far as I am concerned, the data provided in this thread is insufficient to draw such a conclusion. However, if it is conclusive from your perspective, you could simply always use cudaMallocHost()
.
Thanks for reply.
You are right, the actual time is the total execution time of cudaMemcpy. For some reason, I must use the way malloc+cudaHostRegister, so i try to get the reason why malloc+cudaHostRegister is slow, and then to improve the performance of malloc+cudaHostRegister.
Let’s assume the performance difference of 3% for copy operations is real (not just an artifact of a particular configuration)and reproduceable across all your intended target platforms.
My expectation would be that the difference in application-level performance resulting from that is below noise level, as copying data is just part of what the application does. Beyond that, I would expect the impact of copy performance of asynchronous copies to be even less due to overlap with host activity and copies in different directions.
I repalce the malloc() with posix_memalign(), and the align size as the pagesize of system, then used cudaHostRegister to register the memory as page-locked, the execution time of cudaMemcpy is same with the cudaMallocHost, I still don’t know the reason. How do you feel about this case?
Thanks!
I do believe that malloc()
does not guarantee alignment with the a system page boundary, but I don’t know much about posix_memalign()
(I also don’t know what cudaMallocHost()/cudaHostAlloc()
does in this regard). It’s hard to imagine that would make a difference except for very specific transfer sizes and transfer starting points. If there is an issue here and posix_memalign()
is working satisfactorily for you, perhaps you should use that.