Drive PX 2 inference performance

KoMoR · October 20, 2017, 10:04am

Hello

I’ve been migrating our Smart Vehicle CNN inference from Standard PC with 1080 Ti GPU card to Drive PX 2 AutoChauffeur.

I’ve observed a significant increase in inference time(Drive PX 2 is about 3x slower compared to 1080Ti)

Further investigation using giexec for googlenet from tensorrt 2.1 package ( /usr/srs/tensorrt/data/googlenet) seems to confirm these observations:

./giexec --model=…/data/googlenet/googlenet.caffemodel --deploy=…/data/googlenet/googlenet.prototxt --output=prob --batch=16 --device=0

On PC with 1080Ti the avg. time is 7.84 ms, on Drive PX2 31.8 ms
If I enable --half2 mode and use --device=1 giexec reports inference time of 60.58 ms

Batch size 1 gives 1.22 ms for 1080Ti, 3.26 ms for DrivePX2 device 0, 10.52 ms for DrivePX2 device 1 with half2 mode.

I have several questions:

Is it the expected result?
Could you please specify where I can find information about how many CUDA cores each Tegra GPU and additional dGPU have.
Can I directly compare CUDA cores present on DrivePX2 Tegra to 1080Ti graphic card CUDA cores?
Which factor (cores, GFLOPS?) can be used to compare Tegra dGPU to 1080Ti GPU?
Is it possible to use multiple Tegras or GPU-dGPU combination to speed up the inference ?

Thanks

SteveNV · October 23, 2017, 11:02am

Dear KoMoR,

Could you please help to check DPX2 iGPU and dGPU with below command?

==============================
$gedit ~/.bashrc
export PATH=/usr/local/cuda-8.0/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/targets/aarch64-linux/lib:$LD_LIBRARY_PATH

$ source ~/.bashrc
$ nvcc --version

$ /usr/local/cuda-8.0/bin/cuda-install-samples-8.0.sh ~/

$cd ~/NVIDIA_CUDA8.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery

To use iGPU, please set below.
$ export CUDA_VISIBLE_DEVICES=1

You can use GFlops as a measure to compare performance of any two GPUs.
Can you please set the following environment variable and see if it improves the performance.
export CUDA_FORCE_PTX_JIT=1

Hope that helps

KoMoR · October 24, 2017, 10:06am

Hello

After setting the export CUDA_FORCE_PTX_JIT=1
the giexec starts, but does not start processing (seems like hanged)

Could you tell me how can I extract the GFlops for DrivePX2 GPU cores ? Is there any manual/datasheet describing the architecture of GPU and dGPU in DrivePX2 (technical not marketing material)?

I’ve run the deviceQuery after setting the CUDA_VISIBLE_DEVICES to 1

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GP10B”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 6660 MBytes (6983639040 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1275 MHz (1.27 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GP10B
Result = PASS

Thanks for your assistance

SivaRamaKrishnaNV · October 24, 2017, 12:22pm

Dear KoMoR,
You can calculate Flops for a GPU like below.
Flops = number of SM * number of cores per SM * GPU clock speed * operations per cycle.

You can consider 2 operations per cycle.

Hope this helps.

Topic		Replies	Views
Drive PX 2 Caffe performance General	2	1331	March 5, 2018
Performance of Drive PX2 in comparison to Titan XP \| need help General	6	1045	October 12, 2021
Drive PX 2: Improve the performance of cudamemcpy HtoD CUDA Programming and Performance	6	1408	March 2, 2018
Inference Speed Jetson Xavier NX pytorch	6	886	April 12, 2023
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	762	March 13, 2023
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12287	October 7, 2010
ValueError: this machine only has: ['/cpu:0', '/gpu:0'] General	4	2874	March 20, 2019
Separate GPU for Parallel on Jeton AgxOrin Jetson AGX Xavier gpu	19	54	August 28, 2024
Low performance on Drive AGX Driver DRIVE AGX Xavier General driveos-dl	6	898	May 26, 2022
How to set PX2's GPUs work in fully compute mode? DRIVE Hardware	1	1572	June 7, 2017

Drive PX 2 inference performance

To use iGPU, please set below. $ export CUDA_VISIBLE_DEVICES=1

Related topics

To use iGPU, please set below.
$ export CUDA_VISIBLE_DEVICES=1