We are iteratively processing a lot of data on the GPU and we are iteratively sending it back to the host chunk by chunk. The chunk size is fixed and constant. We are using streams to overlap computation with memory transfer and to do so we are using pinned memory allocated with cudaHostAlloc and cudaHostAllocDefault as allocation flag. To give a sense of the data and chunk size, it is about hundreds of Gb for the data and approximately hundreds of Mb for the chunk size. Since we cannot allocate the entire data using pinned memory, we are offloading iteratively each chunk from pin to pageable memory.
In terms of code, we are performing the following steps:
- Process a chunk on the GPU (computeStream)
- In parallel copy processed chunk from device to host (pinned) employing cudaMemcpyAsync (transferStream)
- Copy chunk from pinned memory to pageable memory using std::memcpy or cudaMemcpyAsync with cudaMemcpyHostToHost (transferStream)
The goal is to “hide” the transfer time to allow continuous processing while transferring the data. It seems that the bottleneck is given by the throughput of copying the chunk from pinned to pageable memory.
Copying the chunk from Device to Host pinned memory is performed simultaneously as we wanted. However, when copying from pinned to pageable the Host is locked and the copy time varies a lot and takes a long time, compared to Device to Host transfer time. From our observations, it seems that the throughput from pinned to pageable varies from
41.8 Gb/s to 3.95 Gb/s. To give you a sense of what it is going on, here below you can see a screenshot of the Nvidia profiler output:
Does anyone know what could be the reason why the Host memory throughput is widely varying? What kind of throughput should we expect? Is there a workaround?
Thanks for the help!