cudahostalloc vs memcpy tradeoff

jaisingla · November 24, 2014, 3:05pm

Hi,

cudaHostAlloc is taking 350ms and cudamemcpy is happening on 6 GBps in this case

whereas if i do malloc and cudamemcpy is happening only at 1.8 GBps which is very slow.

Please suggest me the way to make cudamemcpy faster using malloc…

njuffa · November 24, 2014, 6:38pm

With host memory locked, the transfer speed for large copies is limited by the speed with which the GPU’s DMA engine(s) can transfer data across the PCIe interface. The value of 6 GB/sec for large transfers is what would be able to achieve with a PCIe gen2 interface.

When the host memory is pageable, the driver uses an internal locked host memory buffer for DMA transfers from and to the GPU. The data needs to be copied from the user application memory to that buffer (or vice versa), which is a host-side system memory to system memory copy. If you achieve only 1.8 GB/sec for copies from pageable memory, this is a pretty good indication that your system memory throughput is low.

You can measure your system memory throughput with the STREAM benchmark. What kind of system memory throughput does it report? What kind of a system is this? You would want an IvyBridge-based or Haswell-based system for best performance. The performance of your host’s system memory can also be influenced by the speed grade of the memory used, the channel configuration, and various BIOS settings, so you might want to check into that. Also, in a multi-socket system you would want to make sure to set up processor and memory affinity appropriately so the application runs on the “near” CPU and uses the “near” memory relative to the GPU.

As for cudaHostAlloc(), my understanding is that it is a thin wrapper around the relevant host operating system API calls. So its performance is primarily a function of the host’s OS and the performance of the host platform. As with any dynamic allocation, you want want to allocate infrequently and mostly re-use existing allocation. 350 ms seems high, is cudaHostAlloc() the very first CUDA API call made by the app? In that case it would trigger the one-time initialization cost of the CUDA context. You would want to call cudaFree(0) prior to any timed CUDA API calls, so the initialization cost is incurred at the cudaFree() call.

Topic		Replies	Views
slow runtime caused by cudaMemcpy() CUDA Programming and Performance	5	11740	November 19, 2009
cudaHostAlloc memory initial time CUDA Programming and Performance	0	356	August 19, 2018
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	597	March 7, 2019
cudaHostAlloc performance is slow CUDA Programming and Performance	1	1129	June 26, 2012
Cudamalloc time consuming? CUDA Programming and Performance	5	2526	July 22, 2009
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8772	November 17, 2021
zero copy using cudaHostAlloc vs normal malloc+cudaMalloc CUDA Programming and Performance	5	4930	May 2, 2012
Memory copy improvement ? CUDA Programming and Performance	6	3072	April 25, 2012
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3335	June 23, 2009
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	409	March 28, 2024

cudahostalloc vs memcpy tradeoff

Related topics