I have read a a lot of guides about cuda and i understand that it is better to group transfers from host to device instead of having a lot of small tranfers. Why that? I can’t understand the exact reason. It is the pci express bus responsible for this? eg. Why is better to transfer 40k data with one cudamemcpy instead of using 10 cudamemcpys of 4k?
I would guess the fixed overhead more has to do with the operating system and driver rather than the PCI-Express bus. I’ve never tried doing a host-to-device transfer with zero bytes in it to see how long it takes, so I’m not sure how large the fixed overhead ends up being. (Not even sure if anything happens when you do a zero length transfer…)
Then why the fixed overhead of the operating system is bigger in smaller transfers? Where is the bottleneck?
Anyone who knows something about?
I still can’t find anyone who can explain me why large transfers are better than small ones. I believe others have the same question too. Where can i find something about that question?
Thanks in advance…
The transfers are performed by DMA engines on the GPU, which have to be programmed by the CPU for each individual transfer. As seibert already said, this creates some constant overhead per transfer.
If you have to transfer many small, scattered results (and can’t change to a more efficient memory layout), consider writing a kernel that copies back results from the device to zerocopy memory on the host.