Drive PX 2: Improve the performance of cudamemcpy HtoD

subramaniyan.alagappan · March 2, 2018, 12:31pm

Hi All,
Currently I am stuck up with a memory transfer - performance bottleneck between host to device in a compute vision(CV) algorithm.

My scenario:
The CV algorithm works on a single channel image of double data type with resolution of 1152x640. A considerable amount of algorithm is moved to CUDA part. The kernel performances are satisfactory, but the memory transfer of image (2D double array) from host to device for kernel operation is the second biggest hotspot in the algorithm for optimization. The trimmed nvprof profile report is as follows:

Start - Duration - Size - Throughput - Device - Context - Stream - Name
777.36ms - 5.5329ms - 5.1534MB - 931.41MB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
778.48ms - 5.4373ms - 5.0000MB - 919.57MB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
865.49ms - 3.2149ms - 5.0000MB - 1.5188GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
886.60ms - 3.3863ms - 5.1534MB - 1.4862GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
1.11558s - 3.2151ms - 5.0000MB - 1.5187GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
1.13646s - 3.3834ms - 5.1534MB - 1.4875GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
1.27230s - 3.2139ms - 5.0000MB - 1.5193GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
1.29281s - 3.3815ms - 5.1534MB - 1.4883GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]

I am copying around 5MB of data from HtoD for every frame processing, at a bandwidth of about 1.5GB/s.

Hardware details:
Board Series : Drive PX 2
Board Configuration : AutoChauffeur
GPU type used for CUDA : Discrete GPU
Device used for CUDA : GP106 (id: 0)

Code Implementation details:
Device memory allocation : cudaMalloc()
Host memory allocation : cudaMallocHost()
copy mechanism : cudaMemcpy2DAsync() with non-default stream

I would like to get some insights to improve the bandwidth and the performance of copy.
Thanks in advance.

Topic		Replies	Views
Drive PX 2: Improve the performance of cudamemcpy HtoD CUDA Programming and Performance	6	1452	March 2, 2018
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5320	February 26, 2010
Cuda Memcopy need over 12ms for 16MB CUDA Programming and Performance	11	2790	January 30, 2009
memcpyDtoH speed is much slower than memcpyHtoD using GeForce 8400M GS on Vista CUDA Programming and Performance	0	1873	September 30, 2011
Memory copy improvement ? CUDA Programming and Performance	6	3145	April 25, 2012
Performance problem of memcpy in Tesla CUDA Programming and Performance	7	1860	March 24, 2010
Low Memory Throughput (D2H) CUDA Programming and Performance	8	2357	May 7, 2014
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2554	February 6, 2008
cudaMemcpy2D() and a few gray hairs It's very slow CUDA Programming and Performance	8	4618	February 13, 2009
memCpy : Device to Device VERY SLOW CUDA Programming and Performance	7	2884	September 13, 2009

Drive PX 2: Improve the performance of cudamemcpy HtoD

Related topics