Improving data transfer performance from host to device

I have a code in which I have to frequently transfer the huge amount of data from host to device for calculation in device. Due to this data transfer overhead the computation is much slower than it is expected. What are the ways in which this transfer can be effectly used. Can we use cuda streams for data copying asynchronously?

If you have PCI-E 3.0 with full x16 lanes then you can get at least 12GBs across the bus both directions using ‘pinned’ host memory.

You do need a decent PC for that to work, as the motherboard and DRAM type makes a difference.

Run the CUDA-Z utility and it will show to both the CPU-GPU pinned memory speed and the pageable speed, and there should be a large difference between the two. For example my reading for pinned is 12 GBs, while my reading for pageable is 5 GBs.

If you have a GPU with 2 copy engines (GTX 980, Tesla line) then you can run concurrent transfers in different directions.

Yes, streams can help as well in certain situations, but the underlying hardware muse be able to support.

Google this stuff, there is a lot of literature on this topic.

[CudaaduC posted while I was still typing so there is some overlap with my post]

It would help to know what kind of GPU(s) you are using and what your host system looks like. What kind of PCIe transfer rates are you currently seeing, at what transfer block size?

The first thing you might want to check is whether the PCIe transfers operate at the best possible speed supported by your hardware. Ideally you would want your GPU in a PCIe gen3 x16 slot that provides >= 10 GB/sec throughput for large transfers. Some systems operate some slots at x4, or drop from x16 to to x8 configuration when more than one GPU is being used. nvidia-smi shows what interface configuration is currently in use, but you must check while the transfers are ongoing. When the PCIe interface is idle, it operates at reduced performance to save power.

If your system is a multi-socket system, ensure via CPU and memory affinity settings that each GPU always “talks” to the “near” CPU.

In general, PCIe transfer rate increases with the size of the transfer, you may need blocks up to 16 MB in size to reach full throughput. Of course the nature of your use case may limit the largest practical transfer size.

CUDA streams allow you to overlap kernel computation with host<->device copies. If you have a GPU with dual copy engines like a Tesla, you will also be able to perform simultaneous copies to and from the GPU, as PCIe provides full duplex operation. How perfect the overlap is depends on actual time needed for copies versus time needed for kernels. I have certainly encountered real-life applications where copy operations were hidden almost perfectly behind kernel execution times, making kernel execution the performance-dominant factor.