I’ve heard (and seen, indirectly) that cudaMalloc()
is much slower than C malloc()
or host-bound C++ memory management like std::vector<T>
allocation, reservation, and resizing. I’m not sure if it’s just a matter of latency or if the size of allocations on the device carries a much lower performance than similar allocations in host memory. What about cudaHostAlloc()
, a function I use a lot in my code base for allocating page-locked memory on the host for optimal GPU transfers? I can test, but should I expect the performance to be competitive with malloc
? As a corollary, I’ve heard opinions that memory allocated by cudaHostAlloc()
should be treated as a scarce resource and that it’s only limited by the total amount of CPU-accessible RAM in the server or workstation. If it’s fragmented all over the place and cudaHostAlloc()
has to account for this, I can see how having large amounts of such memory could get unwieldy, but in my studies thus far, even allocating many GB of memory through cudaHostAlloc()
, I have not encountered anything that looks like a problem exacerbated by high volume allocations.
These opinions are perhaps driven by this comment in the “Best Practices Guide”:
“Pinned memory should not be overused. Excessive use can reduce overall system performance because pinned memory is a scarce resource, but how much is too much is difficult to know in advance. Furthermore, the pinning of system memory is a heavyweight operation compared to most normal system memory allocations, so as with all optimizations, test the application and the systems it runs on for optimal performance parameters.”
Yeah, that helps everything jibe. I’ll look into steering away from automatically allocating pinned memory in conjunction with the largest or most common data structures, which will be possible with the changes I’m working on. I’ll do some tests on the latency / performance of cudaHostAlloc()
versus malloc()
but if anyone knows something about this please chime in!
cudaHostMalloc()
is really just a thin wrapper around operating system API calls. How fast or how slow it is is therefore outside code that NVIDIA controls, i.e. we are at the mercy of the OS. My expectation would be that its speed decreases with increasing allocation size and decreasing amount of free memory (both of which make it harder to gather up the number of contiguous pages needed).
This is starting to make more sense, then. The scarcity comes about as a matter of how easy it is to assemble a contiguous block of page-locked memory. That had crossed my mind earlier, but I wasn’t seeing any issues with what I was doing at the time so I just kept on going merrily about my business. Now that I know what to look for, though, I can be more judicious about memory usage.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.