cudahostalloc vs memcpy tradeoff

Hi,

cudaHostAlloc is taking 350ms and cudamemcpy is happening on 6 GBps in this case

whereas if i do malloc and cudamemcpy is happening only at 1.8 GBps which is very slow.

Please suggest me the way to make cudamemcpy faster using malloc…

With host memory locked, the transfer speed for large copies is limited by the speed with which the GPU’s DMA engine(s) can transfer data across the PCIe interface. The value of 6 GB/sec for large transfers is what would be able to achieve with a PCIe gen2 interface.

When the host memory is pageable, the driver uses an internal locked host memory buffer for DMA transfers from and to the GPU. The data needs to be copied from the user application memory to that buffer (or vice versa), which is a host-side system memory to system memory copy. If you achieve only 1.8 GB/sec for copies from pageable memory, this is a pretty good indication that your system memory throughput is low.

You can measure your system memory throughput with the STREAM benchmark. What kind of system memory throughput does it report? What kind of a system is this? You would want an IvyBridge-based or Haswell-based system for best performance. The performance of your host’s system memory can also be influenced by the speed grade of the memory used, the channel configuration, and various BIOS settings, so you might want to check into that. Also, in a multi-socket system you would want to make sure to set up processor and memory affinity appropriately so the application runs on the “near” CPU and uses the “near” memory relative to the GPU.

As for cudaHostAlloc(), my understanding is that it is a thin wrapper around the relevant host operating system API calls. So its performance is primarily a function of the host’s OS and the performance of the host platform. As with any dynamic allocation, you want want to allocate infrequently and mostly re-use existing allocation. 350 ms seems high, is cudaHostAlloc() the very first CUDA API call made by the app? In that case it would trigger the one-time initialization cost of the CUDA context. You would want to call cudaFree(0) prior to any timed CUDA API calls, so the initialization cost is incurred at the cudaFree() call.