I experimented with two types of Pinned/Pageable, increasing the cudamemcpy transfer size little by little. Below is the graph of Throughput(GB/s) vs. transfer size(KB).
I have two questions.
If so, why is the throughput achievable when sending small data (both in the case of Pageable/Pinned) is low?
n my experience, transfers to/from pageable memory reach reliably over 6GB/s when the size is around 1MB. Below that, it has a lower throughput.
–> In a packet in PCIe, I know that if the payload is small it has about (2b/130b) = 1.5% overhead. However, I am not sure if this is a major factor in the low throughput. If someone knows the correct answer, please let me know. Or, please let me know how I can prove that this is the main factor.
The throughput is about 1/2 times higher when using pageable memory than when using pinned memory. Why is this happening?
My guess is, the transfer process to/from pageable memory is
(1) Pinned memory allocation
(2) copy from pageable memory to pinned memory
(3) copy from pinned memory to device memory
I think it is because it goes through the process of. On the other hand, if the pinned memory is allocated from the beginning, you only need to perform the process (3).
In my environment, the host memory bandwidth is about 17GB/s, which is similar to the PCIe bandwidth (15.8GB/s). Therefore, it seems to show a performance difference of about 2 times.
If my guess is correct, is the time spent in (2) related to the host memory bandwidth?
I hope for your answer. Thank you.