We are iteratively processing a lot of data on the GPU and we are iteratively sending it back to the host chunk by chunk. The chunk size is fixed and constant. We are using streams to overlap computation with memory transfer and to do so we are using pinned memory allocated with cudaHostAlloc and cudaHostAllocDefault as allocation flag. To give a sense of the data and chunk size, it is about hundreds of Gb for the data and approximately hundreds of Mb for the chunk size. Since we cannot allocate the entire data using pinned memory, we are offloading iteratively each chunk from pin to pageable memory.
In terms of code, we are performing the following steps:
for n-step:
Process a chunk on the GPU (computeStream)
In parallel copy processed chunk from device to host (pinned) employing cudaMemcpyAsync (transferStream)
Copy chunk from pinned memory to pageable memory using std::memcpy or cudaMemcpyAsync with cudaMemcpyHostToHost (transferStream)
The goal is to “hide” the transfer time to allow continuous processing while transferring the data. It seems that the bottleneck is given by the throughput of copying the chunk from pinned to pageable memory.
Copying the chunk from Device to Host pinned memory is performed simultaneously as we wanted. However, when copying from pinned to pageable the Host is locked and the copy time varies a lot and takes a long time, compared to Device to Host transfer time. From our observations, it seems that the throughput from pinned to pageable varies from
41.8 Gb/s to 3.95 Gb/s. To give you a sense of what it is going on, here below you can see a screenshot of the Nvidia profiler output:
Does anyone know what could be the reason why the Host memory throughput is widely varying? What kind of throughput should we expect? Is there a workaround?
Thanks for the help!
The question “how can I copy data fast” immediately raises a red flag in my mind. Pure data movement, i.e. without computation, generally is wasteful. Zero-copy interfaces are desirable. Can you avoid copying data, maybe by double buffering, or by operating the pinned memory as a ring buffer (as data is filled by the GPU at one end, it is retrieved for downstream processing at the other end)?
Is the host a multi-socket system or something equivalent like a CPU constructed from chiplets with internal high-speed interconnect? Is memory and CPU affinity for the application controlled appropriately, i.e. CPU “talks” to “near” GPU and “near” system memory? Is there other host activity besides the app that is keeping the system memory controller(s) busy?
3.95 GB/sec (I assume it is GB not Gb) seems like abysmal system memory throughput for chunk sizes in the MB range on a modern system. What kind of memory subsystem does this system have? These days I would assume multi-channel DDR4. What function specifically is invoked to copy the data?
The data movement is necessary in our case since the data are processed on the GPU but we need to offload the processed data using pinned memory so that we can overlap computation with memory transfer. We are thinking about trying to have a specific thread handling the copying process while a different one is launching the CUDA kernels using two pinned buffers, as you are suggesting.
We have a multi-socket system composed of 16 DDR4 of 32 GB each. All the computation and transfer are done locally and I do not think there is a host activity that keeps the system memory controller busy. Do you know if there is a better way to check beside “top” or “htop”?
The throughput it is indeed in GB/s as you correctly said. The strange thing is that for the first hundred steps, the throughput between pinned to pageable memory is very high and acting normally. However, after that, the performance drastically decreases and never gets back to what it was before the loss.
For copying the pinned to pageable memory we tried multiple functions: std::memcpy, std::memmov, std::mempcpy, and cudaMemcpyAsync with cudaMemcpyHostToHost.
Given that you are using a multi-socket system, are you using numactl to bind CPUs and memory? Your description “all the computation and transfer are done locally” seems to suggest it.
The only mechanism I can think of that would drop the copy throughput by a factor of ten is some form of thrashing. Just to make sure: Have you confirmed that no swapping of memory to disk occurs? I am not sure why you would see the kind of temporal cliff you describe. Does this performance cliff always occur after exactly 100 steps?
What are the actual CPUs (vendor, model) used in this system?
Can you reveal any details of the software function used to perform the bulk copy between the pinned buffer and pagable memory?
It does not look like the memory is being swapped as the application is running, and the performance loss does not occur at a predictable step of the processing flow.
The CPU is the following : Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
For the memory copy, we are simply doing the following:
std::memcpy(pageableMem+(ichunk-1)*chunkSize, pinMem, chunkSize*sizeof(double));
*headscratch* Nothing jumps out at me right now. It has been a long time since I last investigate copy performance in detail. Back then, for large bulk copies, it was all about prefetching, non-temporals stores, TLB priming etc., etc. I would assume by now standard libraries have all of that magic built-in and then some.
Are you running on bare metal or inside a virtual machine of some sort?
I assume you have already double checked your measurement methodology to make sure that the time reported for the bulk copies really is spent copying, and does not include mis-attributed time spent elsewhere?
At the end, we simply allocate the entire data to be processed as pinned memory. It is not the most portable solution, but it fixed the issue with the memory bandwidth problem since we avoid the HostToHost copy altogether.