Pytorch tiny yolo3 performance result

Hi,
I’m running a simple detect on image using pytorch 0.4 that was compiled on TX2. The results of single image detection are ~1.5 seconds. This is the code that is being executed

python3 detect.py yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg data/coco.names
layer     filters    size              input                output
    0 conv     16  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  16
    1 max          2 x 2 / 2   416 x 416 x  16   ->   208 x 208 x  16
    2 conv     32  3 x 3 / 1   208 x 208 x  16   ->   208 x 208 x  32
    3 max          2 x 2 / 2   208 x 208 x  32   ->   104 x 104 x  32
    4 conv     64  3 x 3 / 1   104 x 104 x  32   ->   104 x 104 x  64
    5 max          2 x 2 / 2   104 x 104 x  64   ->    52 x  52 x  64
    6 conv    128  3 x 3 / 1    52 x  52 x  64   ->    52 x  52 x 128
    7 max          2 x 2 / 2    52 x  52 x 128   ->    26 x  26 x 128
    8 conv    256  3 x 3 / 1    26 x  26 x 128   ->    26 x  26 x 256
    9 max          2 x 2 / 2    26 x  26 x 256   ->    13 x  13 x 256
   10 conv    512  3 x 3 / 1    13 x  13 x 256   ->    13 x  13 x 512
   11 max          2 x 2 / 1    13 x  13 x 512   ->    13 x  13 x 512
   12 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   13 conv    256  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 256
   14 conv    512  3 x 3 / 1    13 x  13 x 256   ->    13 x  13 x 512
   15 conv    255  1 x 1 / 1    13 x  13 x 512   ->    13 x  13 x 255
   16 detection
   17 route  13
   18 conv    128  1 x 1 / 1    13 x  13 x 256   ->    13 x  13 x 128
   19 upsample           * 2    13 x  13 x 128   ->    26 x  26 x 128
   20 route  19 8
   21 conv    256  3 x 3 / 1    26 x  26 x 384   ->    26 x  26 x 256
   22 conv    255  1 x 1 / 1    26 x  26 x 256   ->    26 x  26 x 255
   23 detection
Loading weights from ran/yolov3-tiny.weights... Done!
data/dog.jpg: Predicted in 1.477206 seconds.
4 box(es) is(are) found
car: 0.926820
car: 0.771720
dog: 0.999927
bicycle: 0.999965
save plot results to predictions.jpg

Tegra stats

RAM 3402/7846MB (lfb 10x4MB) CPU [10%@2034,off,off,10%@2034,69%@2035,15%@2034] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.5C PMIC@100C thermal@32.2C VDD_IN 4159/3324 VDD_CPU 1068/339 VDD_GPU 152/152 VDD_SOC 763/690 VDD_WIFI 19/19 VDD_DDR 900/859
RAM 3636/7846MB (lfb 10x4MB) CPU [1%@2035,off,off,0%@2035,26%@2035,72%@2034] EMC_FREQ 3%@1600 GR3D_FREQ 4%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@31.9C VDD_IN 4273/3365 VDD_CPU 992/368 VDD_GPU 228/155 VDD_SOC 763/693 VDD_WIFI 19/19 VDD_DDR 979/864
RAM 3904/7846MB (lfb 10x4MB) CPU [0%@2033,off,off,0%@2035,0%@2034,100%@2036] EMC_FREQ 3%@1600 GR3D_FREQ 6%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.4C VDD_IN 4312/3404 VDD_CPU 992/394 VDD_GPU 228/158 VDD_SOC 763/696 VDD_WIFI 19/19 VDD_DDR 996/869
RAM 4101/7846MB (lfb 10x4MB) CPU [0%@2033,off,off,1%@2036,0%@2035,99%@2034] EMC_FREQ 4%@1600 GR3D_FREQ 10%@1122 APE 150 BCPU@33C MCPU@33C GPU@31.5C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.4C VDD_IN 4312/3441 VDD_CPU 992/417 VDD_GPU 228/161 VDD_SOC 763/699 VDD_WIFI 19/19 VDD_DDR 1017/875
RAM 4397/7846MB (lfb 10x4MB) CPU [0%@1998,off,off,0%@1999,0%@1998,99%@2000] EMC_FREQ 4%@1600 GR3D_FREQ 2%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.4C VDD_IN 4350/3475 VDD_CPU 992/440 VDD_GPU 228/163 VDD_SOC 763/701 VDD_WIFI 19/19 VDD_DDR 1015/881
RAM 3367/7846MB (lfb 10x4MB) CPU [17%@1998,off,off,0%@1997,0%@1999,58%@1998] EMC_FREQ 5%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@32.5C MCPU@32.5C GPU@31C PLL@32.5C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.2C VDD_IN 3930/3492 VDD_CPU 687/449 VDD_GPU 228/166 VDD_SOC 763/703 VDD_WIFI 19/19 VDD_DDR 996/885
RAM 3367/7846MB (lfb 10x4MB) CPU [0%@2035,off,off,0%@2034,0%@2035,0%@2035] EMC_FREQ 3%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@32.5C MCPU@32.5C GPU@31C PLL@32.5C Tboard@27C Tdiode@28.5C PMIC@100C thermal@31.9C VDD_IN 3358/3487 VDD_CPU 305/444 VDD_GPU 152/165 VDD_SOC 687/703 VDD_WIFI 19/19 VDD_DDR 864/884
RAM 3367/7846MB (lfb 10x4MB) CPU [0%@2035,off,off,0%@2036,0%@2035,0%@2033] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@32.5C MCPU@32.5C GPU@31.5C PLL@32.5C Tboard@27C Tdiode@28.5C PMIC@100C thermal@31.9C VDD_IN 3283/3480 VDD_CPU 305/439 VDD_GPU 152/165 VDD_SOC 687/702 VDD_WIFI 19/19 VDD_DDR 862/883

Notice that jetson clocks was executed to leverage full GPU power.
Cuda 9 is available and seems to be used.

Questions

  1. Can you tell if the above results are reasonable ?
  2. Are there any additional tuning that can be made to improve performance?
  3. I’d appreciate if there is a performance reference that can be used on running yolo model on TX1 or TX2.

Thanks a lot!
Tal

Hi, Tal

Thanks for your questions.

1. The results looks weird since the GPU utlility is pretty low. 5%@1600 GR3D_FREQ
We will check this model on TX2 and update more information with you.

2. You can also apply nvpmodel before running jetson_clocks:

sudo nvpmodel -m 0
sudo jetson_clocks.sh

3. If c++ interface is acceptable, here is our YOLO sample for Jetson:
https://github.com/vat-nvidia/deepstream-plugins

Thanks