Performance of Drive PX2 in comparison to Titan XP | need help

We are performing a trial with TensorFlow-GPU 1.9, TensorFlow object detection, python 3, CUDA 9.0 Ubuntu 16.04.

  1. The test gave good results with Faster R-CNN, with 16-FPS, using system with NVIDIA Titan XP

  2. The same test was conducted with Drive PX2, but the performance is not good, its giving <2-FPS

Need help to improve the performance in Drive PX2.

Dear RatheeshR,
Could you share TitaxXp configuration details(share output of CUDA DeviceQuery sample). Also, you can use TensorRT to optimize your tensorflow model furthur on Drive PX2

TitaxXp configuration details(share output of CUDA DeviceQuery sample).
(Python program not initiated)

sudo ./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “TITAN Xp”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12190 MBytes (12782075904 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5705 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

PX2 TEGRA-A configuration details(share output of CUDA DeviceQuery sample).

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

nvrm_gpu: Bug 200215060 workaround enabled.

Detected 2 CUDA Capable device(s)

Device 0: “Graphics Device”

CUDA Driver Version / Runtime Version 9.0 / 9.0

CUDA Capability Major/Minor version number: 6.1

Total amount of global memory: 3840 MBytes (4026466304 bytes)

( 9) Multiprocessors, (128) CUDA Cores/MP: 1152 CUDA Cores

GPU Max Clock rate: 1290 MHz (1.29 GHz)

Memory Clock rate: 3003 Mhz

Memory Bus Width: 128-bit

L2 Cache Size: 1048576 bytes

Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)

Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers

Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 65536

Warp size: 32

Maximum number of threads per multiprocessor: 2048

Maximum number of threads per block: 1024

Max dimension size of a thread block (x,y,z): (1024, 1024, 64)

Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and kernel execution: Yes with 2 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support: Disabled

Device supports Unified Addressing (UVA): Yes

Supports Cooperative Kernel Launch: Yes

Supports MultiDevice Co-op Kernel Launch: Yes

Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0

Compute Mode:

 < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: “NVIDIA Tegra X2”

CUDA Driver Version / Runtime Version 9.0 / 9.0

CUDA Capability Major/Minor version number: 6.2

Total amount of global memory: 6668 MBytes (6991458304 bytes)

( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores

GPU Max Clock rate: 1275 MHz (1.27 GHz)

Memory Clock rate: 1600 Mhz

Memory Bus Width: 128-bit

L2 Cache Size: 524288 bytes

Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)

Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers

Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per multiprocessor: 2048

Maximum number of threads per block: 1024

Max dimension size of a thread block (x,y,z): (1024, 1024, 64)

Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and kernel execution: Yes with 1 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: Yes

Support host page-locked memory mapping: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support: Disabled

Device supports Unified Addressing (UVA): Yes

Supports Cooperative Kernel Launch: Yes

Supports MultiDevice Co-op Kernel Launch: Yes

Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0

Compute Mode:

 < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Peer access from Graphics Device (GPU0) -> NVIDIA Tegra X2 (GPU1) : No

Peer access from NVIDIA Tegra X2 (GPU1) -> Graphics Device (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2

Result = PASS


Please share the steps to be taken for TensorRT (I am new to TensorRT)

Dear RatheeshR,
The comparison of a single dGPU on DrivePX2 with Titan xP is not fair. It is clear that Titak xp has more SMs, better GPU clock speed compared to a single dGPU on Drive PX2.
Note that a single Titan xp has 12TFlops computational power where as Drive PX2 has ~8TFlops(includes, two iGPU + two dGPU).

For details on TensorRT: Please look at https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html
For Tensorflow workflow: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#working_tf
Faster RCNN sample: https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#fasterrcnn_sample

Thank you

.