Low Memory Throughput (D2H)

Jnesp · April 24, 2014, 3:38pm

I am experiencing a dramatic reduction of throughput in a data transfer (device to host). My code context (simplified) is next:

At first I use cudaMemcpyAsync because I take advantage of the Streams (previously I allocate the host memory involved with cudaHostAlloc) to overlap these data transfers (4) with four kernels.

After that, I continue executing other kernels (5) until everything is over and at last I tried to do four data transfers (cudaMemcpy D2H, the host memory involved in this case is not pinned). It is here where I find the problem.

The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast).

The fact is that, the advantage I take from the use of Streams, I lose it with this drop of memory performance and I do not know why it is happening.

I am using a nVidia GTX480 Card with Linux Ubuntu 10.04. Thanks in advance.

Robert_Crovella · April 24, 2014, 6:46pm

What is your timing methodology used to compute the transfer rate? You may be inadvertently including kernel execution time in your measurement of transfer time.

Jnesp · April 24, 2014, 7:06pm

I am seeing the duration of the memcpy’s using nVidia Visual Profiler so I think that it is not a timing mistake…

Jnesp · April 28, 2014, 10:50am

I really need help with this, please… I’ve tried everything and it is not working.

cbuchner1 · April 28, 2014, 12:26pm

“the host memory involved in this case is not pinned”

but isn’t it expected to see much lower throughput on transfers involving unpinned memory? This memory is pageable, and hence significant delays will occur when the Windows kernel has to page in this memory first.

Christian

Jnesp · April 28, 2014, 4:47pm

For Christian:

“The first transfer is very slow (from 5~6 GB/s to 1.4 GB/s) and the following three increase their throughput to 2.2, 2.7 and 5.8 GB/s (this last one is very fast)”

The first throughput is very slow even for pageable memory and I think that is not an explanation because the following transfers improve their throughput, moreover, I reorder the four data transfers and the problem is still occurring with the former first transfer.

Thanks a lot :)

Jnesp · May 7, 2014, 7:43am

Any new ideas???

Robert_Crovella · May 7, 2014, 11:41am

Post a short, complete app that reproduces the problem.

cbuchner1 · May 7, 2014, 12:05pm

Are the buffers so large that using pinned host memory buffers is not an option?

If so, try creating two smaller pinned staging buffers (buffer A and B) per CUDA stream.
You get the full speed for D2H copies, but you will have to copy in smaller chunks, e.g.
128MB chunks.

initiate an async D2H copy to buffer A
synchronize CUDA stream
initiate an async D2H copy to buffer B
perform a H2H memcpy from buffer A into the final (unpinned) destination address
synchronize CUDA stream
perform a H2H memcpy from buffer B into the final (unpinned) destination address
repeat until the whole transfer is done

I am not sure if it is still possible to nicely overlap two or more CUDA streams doing theis kind of transfers. The double buffered transfer logic could also be handled in CUDA event callbacks to keep the complexity of this double buffering logic away from the main program code.

Topic		Replies	Views
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	113	August 22, 2024
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1389	September 29, 2021
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2806	June 24, 2016
Very long D2H duration compared to H2D Profiling Linux Targets cuda , nsight	2	856	March 2, 2022
Slow memory transfers CUDA Programming and Performance	7	1990	May 23, 2011
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3198	September 1, 2009
Handful of Slow Memory Transfers CUDA Programming and Performance	7	813	June 17, 2016
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1421	January 23, 2009
Copies between CPU and GPU CUDA Programming and Performance	8	5337	November 3, 2009
Unexpected managed (unified) memory behaviour CUDA Programming and Performance	0	553	May 29, 2019

Low Memory Throughput (D2H)

Related topics