Drive PX 2 inference performance


I’ve been migrating our Smart Vehicle CNN inference from Standard PC with 1080 Ti GPU card to Drive PX 2 AutoChauffeur.

I’ve observed a significant increase in inference time(Drive PX 2 is about 3x slower compared to 1080Ti)

Further investigation using giexec for googlenet from tensorrt 2.1 package ( /usr/srs/tensorrt/data/googlenet) seems to confirm these observations:

./giexec --model=…/data/googlenet/googlenet.caffemodel --deploy=…/data/googlenet/googlenet.prototxt --output=prob --batch=16 --device=0

On PC with 1080Ti the avg. time is 7.84 ms, on Drive PX2 31.8 ms
If I enable --half2 mode and use --device=1 giexec reports inference time of 60.58 ms

Batch size 1 gives 1.22 ms for 1080Ti, 3.26 ms for DrivePX2 device 0, 10.52 ms for DrivePX2 device 1 with half2 mode.

I have several questions:

  1. Is it the expected result?
  2. Could you please specify where I can find information about how many CUDA cores each Tegra GPU and additional dGPU have.
  3. Can I directly compare CUDA cores present on DrivePX2 Tegra to 1080Ti graphic card CUDA cores?
  4. Which factor (cores, GFLOPS?) can be used to compare Tegra dGPU to 1080Ti GPU?
  5. Is it possible to use multiple Tegras or GPU-dGPU combination to speed up the inference ?


Dear KoMoR,

Could you please help to check DPX2 iGPU and dGPU with below command?

$gedit ~/.bashrc
export PATH=/usr/local/cuda-8.0/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/targets/aarch64-linux/lib:$LD_LIBRARY_PATH

source ~/.bashrc nvcc --version

$ /usr/local/cuda-8.0/bin/ ~/

$cd ~/NVIDIA_CUDA8.0_Samples/1_Utilities/deviceQuery
make ./deviceQuery

To use iGPU, please set below.

You can use GFlops as a measure to compare performance of any two GPUs.
Can you please set the following environment variable and see if it improves the performance.

Hope that helps


After setting the export CUDA_FORCE_PTX_JIT=1
the giexec starts, but does not start processing (seems like hanged)

Could you tell me how can I extract the GFlops for DrivePX2 GPU cores ? Is there any manual/datasheet describing the architecture of GPU and dGPU in DrivePX2 (technical not marketing material)?

I’ve run the deviceQuery after setting the CUDA_VISIBLE_DEVICES to 1

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GP10B”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 6660 MBytes (6983639040 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1275 MHz (1.27 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GP10B
Result = PASS

Thanks for your assistance

Dear KoMoR,
You can calculate Flops for a GPU like below.
Flops = number of SM * number of cores per SM * GPU clock speed * operations per cycle.

You can consider 2 operations per cycle.

Hope this helps.