I will throw out the question first.
How much data is copied to the staging area each time under pageable memory using CudaMemcpy?
To my understanding, when pageable memory is used, the CUDA runtime will copy the pages from the pageable memory to the pinned memory for staging, and DMA will kick in to transfer it from host to device. So it has an additional step compared with pinned-memory copy.
So suppose I want to copy 1GB of data, does CUDA copies 1GB of stuff directly to the staging buffers? Or my data will be broken down to several pieces, such as page granularity? I think the granularity of moving data from pageable memory to pinned buffer is essential for determining the DMA performance.
Thanks in advance.
Think about it carefully, then use a profiler to inspect the actual behavior.
In order for a contiguous cudaMemcpy operation to happen, it needs a pointer to the data to be transferred (i.e. all of it). If the profiler doesn’t show you multiple transfers (it never has, in my experience) for a single cudaMemcpy operation, then its a safe bet that all of the data gets copied to pinned memory before the actual host->device transfer occurs.
My comments do not apply to noncontiguous e.g. cudaMemcpy2D case.
Thank you txbob,
What if data is non-contiguous in the memory?
Does it access the gather/scatter list(or page table) to get all the data first and kick off DMA?
Or does it move data chunk-by-chunk, in which each chunk is tracked by a single PTE or scatter list entry?
Thank you so much for your help.
Interesting. I seem to vaguely recall that old versions of the Programming Guide stated a limit (like
64Mb 64kb or so) below which cudaMemcpy() might run asynchronously once the data is copied to the pinned staging area, which would imply a staging area of that size. But the current CUDA Runtime Documentation states:
“For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.”
This suggests Txbob is right (which he always is anyway), even though it would still be compatible with the previous behaviour/documentation. I hadn’t noticed this change (presuming my memory is correct and there has been a change at all).
Thanks for your input. Based on your previous information, I will describe the behavior of cudaMemcpy() as follow. Please advise.
Count bytes of data are copied to the staging area first. Count depends on the size of contiguous data in the memory, but it is bounded to MAX_COUNT. Right?
How do I find MAX_COUNT? Is there any documentation?
DMA transfer is overlapped with staging operation at the host. However, as all the DMA transfer belong to the same stream, the order among different chunks are enforced. Am I right?
Thank you so much for your input.
Turns out my memory was somewhat correct. Even the current CUDA C Programming Guide contains the passage
[i]"The following device operations are asynchronous with respect to the host:
- Kernel launches;
- Memory copies within a single device's memory;
- Memory copies from host to device of a memory block of 64 KB or less;
- Memory copies performed by functions that are suffixed with Async;
- Memory set function calls."
So that would support your description with a count of 65,536 (not 67,108,864. Funny how exponential growth has made remembering units so much harder).
Then it means that for every 64 KB of data or less, PTE is referenced. It not only takes time to access the memory, but also cost the PCIe bandwidth for transferring the descriptors. Wow. This is very expensive!
As current OS support large page size (i.e., 2MB), I don’t understand why Nvidia puts this constrain there. Any comment? Your input has been extremely helpful.
I don’t understand your comment about PCIe bandwidth for transferring descriptors. Copy to the staging area occurs on the host, so no PTEs are passed over PCIe.
Sorry, my mistake. I think PCIe packet should only contain the memory address of the device, right?
Also, is there any special reason for only allowing 64KB of transfer from the host memory to DMA staging area?
I would assume Nvidia has tested different buffer sizes and found the speedup beyond 64kb not warranting the memory footprint tradeoff. After all, cudaMemcpy() as achieving decent throughput.
This is all speculation though. If you are really interested, you could write your own version and benchmark it with different buffer sizes.
Note that the 64kb might also mean that the driver is using two buffers of 32kb each, to overlap the host-buffer and host-device copies. Or it could mean two separate 64kb buffers without logic to combine them if no staging is needed. This really is undocumented terrain.
I really appreciate your information. It is crystal clear, although we are just speculating.
It is entirely possible that the setup described below has changed with CUDA 8.0, which I have not used yet.
The 64KB limit applies to host->device transfers only, and applies to data transport through the GPU push buffer, the same buffer used to transport commands down to the GPU. So basically a copy command is sent down to the GPU along with the data. That is a win for small transfers, but obviously the push buffer is not designed for large bulk transfers.
This has nothing to do with the pinned host-side buffer set up by the CUDA driver to facilitate DMA transfers between the GPU and pageable host memory. I don’t know exactly how big that buffer is (it may differ by driver version or system parameters), but the indications are (or at least were) that its size is on the order of a few MB in size. Larger transfers are moved through that buffer in chunks. Note that maximum PCIe throughput is typically achieved at transfer sizes of 4 MB - 8 MB, so breaking up large transfers into chunks has negligible impact on host/device DMA throughput.
However, use of the buffer means that each transfer involves a DMA copy between the device and the driver’s buffer, plus a system memory copy between the pinned driver buffer and user application memory. Systems with high system memory throughput (I think up to 60 GB/sec are possible with Intel based systems at present*) will show definite benefits in this scenario, as the measured throughput for transfers from/to pageable host memory will be much closer to the throughput from/to pinned host memory.
measured as opposed to theoretical, e.g. Xeon E5-1650 v4 is listed with a theoretical memory throughput of up to 76.8 GB/sec: http://ark.intel.com/products/92994/Intel-Xeon-Processor-E5-1650-v4-15M-Cache-3_60-GHz