Hi,
I’m running a simple detect on image using pytorch 0.4 that was compiled on TX2. The results of single image detection are ~1.5 seconds. This is the code that is being executed
python3 detect.py yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg data/coco.names
layer filters size input output
0 conv 16 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 16
1 max 2 x 2 / 2 416 x 416 x 16 -> 208 x 208 x 16
2 conv 32 3 x 3 / 1 208 x 208 x 16 -> 208 x 208 x 32
3 max 2 x 2 / 2 208 x 208 x 32 -> 104 x 104 x 32
4 conv 64 3 x 3 / 1 104 x 104 x 32 -> 104 x 104 x 64
5 max 2 x 2 / 2 104 x 104 x 64 -> 52 x 52 x 64
6 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128
7 max 2 x 2 / 2 52 x 52 x 128 -> 26 x 26 x 128
8 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256
9 max 2 x 2 / 2 26 x 26 x 256 -> 13 x 13 x 256
10 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512
11 max 2 x 2 / 1 13 x 13 x 512 -> 13 x 13 x 512
12 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024
13 conv 256 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 256
14 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512
15 conv 255 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 255
16 detection
17 route 13
18 conv 128 1 x 1 / 1 13 x 13 x 256 -> 13 x 13 x 128
19 upsample * 2 13 x 13 x 128 -> 26 x 26 x 128
20 route 19 8
21 conv 256 3 x 3 / 1 26 x 26 x 384 -> 26 x 26 x 256
22 conv 255 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 255
23 detection
Loading weights from ran/yolov3-tiny.weights... Done!
data/dog.jpg: Predicted in 1.477206 seconds.
4 box(es) is(are) found
car: 0.926820
car: 0.771720
dog: 0.999927
bicycle: 0.999965
save plot results to predictions.jpg
Tegra stats
RAM 3402/7846MB (lfb 10x4MB) CPU [10%@2034,off,off,10%@2034,69%@2035,15%@2034] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.5C PMIC@100C thermal@32.2C VDD_IN 4159/3324 VDD_CPU 1068/339 VDD_GPU 152/152 VDD_SOC 763/690 VDD_WIFI 19/19 VDD_DDR 900/859
RAM 3636/7846MB (lfb 10x4MB) CPU [1%@2035,off,off,0%@2035,26%@2035,72%@2034] EMC_FREQ 3%@1600 GR3D_FREQ 4%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@31.9C VDD_IN 4273/3365 VDD_CPU 992/368 VDD_GPU 228/155 VDD_SOC 763/693 VDD_WIFI 19/19 VDD_DDR 979/864
RAM 3904/7846MB (lfb 10x4MB) CPU [0%@2033,off,off,0%@2035,0%@2034,100%@2036] EMC_FREQ 3%@1600 GR3D_FREQ 6%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.4C VDD_IN 4312/3404 VDD_CPU 992/394 VDD_GPU 228/158 VDD_SOC 763/696 VDD_WIFI 19/19 VDD_DDR 996/869
RAM 4101/7846MB (lfb 10x4MB) CPU [0%@2033,off,off,1%@2036,0%@2035,99%@2034] EMC_FREQ 4%@1600 GR3D_FREQ 10%@1122 APE 150 BCPU@33C MCPU@33C GPU@31.5C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.4C VDD_IN 4312/3441 VDD_CPU 992/417 VDD_GPU 228/161 VDD_SOC 763/699 VDD_WIFI 19/19 VDD_DDR 1017/875
RAM 4397/7846MB (lfb 10x4MB) CPU [0%@1998,off,off,0%@1999,0%@1998,99%@2000] EMC_FREQ 4%@1600 GR3D_FREQ 2%@1122 APE 150 BCPU@33C MCPU@33C GPU@31C PLL@33C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.4C VDD_IN 4350/3475 VDD_CPU 992/440 VDD_GPU 228/163 VDD_SOC 763/701 VDD_WIFI 19/19 VDD_DDR 1015/881
RAM 3367/7846MB (lfb 10x4MB) CPU [17%@1998,off,off,0%@1997,0%@1999,58%@1998] EMC_FREQ 5%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@32.5C MCPU@32.5C GPU@31C PLL@32.5C Tboard@27C Tdiode@28.75C PMIC@100C thermal@32.2C VDD_IN 3930/3492 VDD_CPU 687/449 VDD_GPU 228/166 VDD_SOC 763/703 VDD_WIFI 19/19 VDD_DDR 996/885
RAM 3367/7846MB (lfb 10x4MB) CPU [0%@2035,off,off,0%@2034,0%@2035,0%@2035] EMC_FREQ 3%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@32.5C MCPU@32.5C GPU@31C PLL@32.5C Tboard@27C Tdiode@28.5C PMIC@100C thermal@31.9C VDD_IN 3358/3487 VDD_CPU 305/444 VDD_GPU 152/165 VDD_SOC 687/703 VDD_WIFI 19/19 VDD_DDR 864/884
RAM 3367/7846MB (lfb 10x4MB) CPU [0%@2035,off,off,0%@2036,0%@2035,0%@2033] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1122 APE 150 BCPU@32.5C MCPU@32.5C GPU@31.5C PLL@32.5C Tboard@27C Tdiode@28.5C PMIC@100C thermal@31.9C VDD_IN 3283/3480 VDD_CPU 305/439 VDD_GPU 152/165 VDD_SOC 687/702 VDD_WIFI 19/19 VDD_DDR 862/883
Notice that jetson clocks was executed to leverage full GPU power.
Cuda 9 is available and seems to be used.
Questions
- Can you tell if the above results are reasonable ?
- Are there any additional tuning that can be made to improve performance?
- I’d appreciate if there is a performance reference that can be used on running yolo model on TX1 or TX2.
Thanks a lot!
Tal