Group transfers from host to device


I have read a a lot of guides about cuda and i understand that it is better to group transfers from host to device instead of having a lot of small tranfers. Why that? I can’t understand the exact reason. It is the pci express bus responsible for this? eg. Why is better to transfer 40k data with one cudamemcpy instead of using 10 cudamemcpys of 4k?

I would guess the fixed overhead more has to do with the operating system and driver rather than the PCI-Express bus. I’ve never tried doing a host-to-device transfer with zero bytes in it to see how long it takes, so I’m not sure how large the fixed overhead ends up being. (Not even sure if anything happens when you do a zero length transfer…)

Then why the fixed overhead of the operating system is bigger in smaller transfers? Where is the bottleneck?

Anyone who knows something about?

I still can’t find anyone who can explain me why large transfers are better than small ones. I believe others have the same question too. Where can i find something about that question?

Thanks in advance…

The transfers are performed by DMA engines on the GPU, which have to be programmed by the CPU for each individual transfer. As seibert already said, this creates some constant overhead per transfer.

If you have to transfer many small, scattered results (and can’t change to a more efficient memory layout), consider writing a kernel that copies back results from the device to zerocopy memory on the host.

Two questions?

  1. Small transfers of how many bytes?
    2.I think that zero copy memory is better to be written or read only once according to cuda c best practises guide. How can i use it for a lot of transfers? My perfomance won’t be decreased?
  1. A few charts of host<->device bandwidth vs. block size similar to this one have been posted to the forums. A little search might bring them up.
    I usually use transfers of the order of megabytes, and the chart I linked to also seems to support this.

  2. Of course all data is supposed to be transferred only once. But with your own kernel you are free to transfer any number of blocks (or any arbitrarily scattered memory layout) in a single invocation.