Low Memcpy Throughput

Hello,
i am writting a kernel and i am using the nvvp profiler to estimate the performance of my project. I use the instruction “cudaMalloc()” to allocate the data in the device memory and i transfer the data from cpu to gpu and vice versa with “cudaMemcpy()”. Tha data i transfer are 1d arrays of 196608 integers (768 kB).
The nvvp profiler outputs the message:

“Low Memcpy Throughput (1.066GB/s avg, for memcpys accounting for 100% of all memcpy time)
The memory copies are not fully using the available host to device badnwidth”.

I know there are other ways to transfer data (using Pinned Memory, Zero Copy Memory , Unified Memory, Unified Virtual Adressing etc).
So the only thing i have to do is to change the way of transfering data to gobal memory, or there are also other techniques to improve Memcpy Throughput?

Thank you!

PCIe is packetized transport, so throughout will differ based on transfer size, with smaller tranfers sizes leading to lower throughput.

What GPU is this? What is the system platform? When you run the bandwidthTest app that comes with CUDA, what is the maximum throughput reported? You can run with --shmoo to get data on the transfer rate at various transfer sizes.

My leading working hypothesis is that your GPU is not plugged into a PCIe gen 3 x16 slot. A close second is the hypothesis that you are using transfers from/to pageable memory, and that your host’s system memory has low throughput (in which case you most definitely would want to switch to the use of pinned host memory).

Here is the output from the bandwidthTest app when run with a PCIe gen 2 x16 interface. Using pinned memory, the throughput at maximum transfer size is about 6.5 GB/sec. With a PCIe gen 3 x16 interface it should be about 12 GB/sec.

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro K2200
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6548.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6525.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     49548.2

Result = PASS