Drive PX 2: Improve the performance of cudamemcpy HtoD

Hi All,
Currently I am stuck up with a memory transfer - performance bottleneck between host to device in a compute vision(CV) algorithm.

My scenario:
The CV algorithm works on a single channel image of double data type with resolution of 1152x640. A considerable amount of algorithm is moved to CUDA part. The kernel performances are satisfactory, but the memory transfer of image (2D double array) from host to device for kernel operation is the second biggest hotspot in the algorithm for optimization. The trimmed nvprof profile report is as follows:

Start - Duration - Size - Throughput - Device - Context - Stream - Name
777.36ms - 5.5329ms - 5.1534MB - 931.41MB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
778.48ms - 5.4373ms - 5.0000MB - 919.57MB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
865.49ms - 3.2149ms - 5.0000MB - 1.5188GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
886.60ms - 3.3863ms - 5.1534MB - 1.4862GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
1.11558s - 3.2151ms - 5.0000MB - 1.5187GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
1.13646s - 3.3834ms - 5.1534MB - 1.4875GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
1.27230s - 3.2139ms - 5.0000MB - 1.5193GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
1.29281s - 3.3815ms - 5.1534MB - 1.4883GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]

I am copying around 5MB of data from HtoD for every frame processing, at a bandwidth of about 1.5GB/s.

Hardware details:
Board Series : Drive PX 2
Board Configuration : AutoChauffeur
GPU type used for CUDA : Discrete GPU
Device used for CUDA : GP106 (id: 0)

Code Implementation details:
Device memory allocation : cudaMalloc()
Host memory allocation : cudaMallocHost()
copy mechanism : cudaMemcpy2DAsync() with non-default stream

I would like to get some insights to improve the bandwidth and the performance of copy.
Thanks in advance.

You’re probably better off asking Drive related questions in the drive forum:

https://devtalk.nvidia.com/default/board/182/drive-platforms/

You might get better and faster responses in the sub-forum dedicated to the Drive hardware platform:

https://devtalk.nvidia.com/default/board/198/drive-hardware/

I know nothing about PX2. The throughput numbers look low for a PCIe interface. I thought PX2 might use a unified physical memory like the Tegra systems, in which case it would be best not to copy.

If you cannot avoid the copy and cannot compress the copied data, the best strategy is to copy data in the biggest chunks possible, as overall throughput improves with transfer size due to unavoidable fixed overhead. For a 2D array this means that it should be stored as one contiguous block of data, so you can use a 1D copy to transfer it. The sort of strided copies performed by cudaMemcpy2D() can be quite slow, often just one tenth the throughput of a contiguous copy.

Drive PX2 includes both Tegra SOC processors (which do have embedded GPUs, just like Jetson) and also separate discrete GPUs (GP106 in this case). There is a PCIE interface in between the Tegra processors and the discrete GPUs. AFAIK it is not a x16 Gen3 PCIE interface.

https://devtalk.nvidia.com/default/topic/1017006/general/using-all-processors-on-autochauffeur-at-the-same-time/post/5180610/#5180610

The connection path appears to be a x4 link, per dGPU. The PEX 8724 is a Gen3 switch, and the ports on the dGPUs (GP106) are definitely Gen3 ports, so I’m guessing that the Tegra end of the x4 link may only be Gen2 capable. Gen2 x4 would line up with the reported 1.5GB/s rate. Alternatively it may be Gen3 but limited due to use of non-pinned buffers for the transfer, but OP indicates use of cudaMallocHost

Are you sure that you need the numerical precision delivered by the ‘double’ datatype (64-bit floating-point) ? In my experience, in computer vision that is very rarely the case. The precision delivered by 32-bit floats (‘float’) or maybe even 16-bit floats (‘half’) might be enough for your computations. That would mean that only half of the data (or a quarter) has to be trasnferred.

Thanks Bob, Juffa and Hannes for your replies.

@Bob,
I have cloned the ticket in the Drive Platforms forum at https://devtalk.nvidia.com/default/topic/1030571/general/drive-px-2-improve-the-performance-of-cudamemcpy-htod/

when I tested with bandwidth test from the CUDA sample applications, I got the following result:

[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: GP106
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1598.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1670.0
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 64197.8
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

@Juffa,

  1. I am required to utilize the discrete GPU to run my CV algorithm, which is a small part of our application. In discrete GPU, I think there is no unified memory support. When I tried to allocate unified memory, Drive PX curses me with error code 71 [Not Supported].
  2. I have also tried with cudaMemcpy() instead of cudaMemcpy2D(), but there is no difference in performance numbers and the profile report shows same bandwidth results

@Hannes,
In our application, the successive modules of mine require the data to be in double precision for achieving maximum accuracy level. So I can’t try with the data type change.

The output from bandwidthTest appears to confirm what txbob hypothesized in #4: The interface to the dGPU is a PCIe gen2 x4 or equivalent, and thus limited to 1600 MB/sec throughput. In other words, your app already performs close to the hardware limit as far as the copies are concerned.

Note that the GP106 has low computational throughput for double-precision computation. I see peak throughput of about 110 GFLOPS on a GP106-based Quadro P2000. My best guess is that the GP106 in the PX2 is probably slower due to power limitations. So if double precision is required for this use case, you shouldn’t expect blazingly fast performance for that portion either.

Given that this is a computer vision application, I wonder what kind of sensor would deliver images with more resolution than can be represented in single precision. Any pointers appreciated, if you are allowed to reveal it.