Why I can't get 40 FPS for TLT YOLOv3 ResNet18 FP16 in 320x320?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Nano B01
• DeepStream Version 5.0.1
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs) question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hi, I’m refer to this doc https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_Performance.html

YoloV3 – ResNet18 in FP16 on Jetson Nano has 11 FPS in 960x544 resolution. I’ve checked it in Jetson Nano, convert provided models and really get ~11 FPS.

So when I train YoloV3 ResNet18 with Transfer Learning Toolkit in 320x320 and convert it to FP16 in Jetson Nano I get only 20 FPS and 11FPS in FP32 - why?

  1. How inference resolution affect on model performance?
  2. What’s the maximum performance for YOLOv3 Resnet18 on Jetson Nano in 320x320?
  3. How to get the maximum performance YOLOv3 Resnet18 on Jetson Nano in 320x320?

Lower resolution should get almost proportional inference fps increasement, but may need higher batch size.

Could you try higher batch?

No, I don’t try higher batch, because referred model doesn’t run with batch.

I think I need to try prunning. Can you accept, that models in reffered doc are prunned? If yes, with which threshold and which method were used?

Could you please try higher batch and check the total fps?

The model is from TLT, TLT automatically prunes the model during training based on the accuracy threshold you set.

Ok, I’ve tried higher batch.

I’ve chosen trtexec for measure performance with the following arguments:
--fp16 --batch=X --useSpinWait

There’s results:
Model converted with max batch=1: 51.5673 ms
Model converted with max batch=2, and --batch=1: 53.2138ms
Model converted with max batch=2, and --batch=2: 98.3157ms

Also we’ve tried prunning, and get the followwing results:
Model converted with max batch=1 and prunned threshold -pth=0.1: 24.7978ms
Model converted with max batch=1 and prunned threshold -pth=0.2: 22.2143ms
Model converted with max batch=1 and prunned threshold -pth=0.3: 18.4939ms

Despite of acceleration of performance, we’ve faced with model output issue after converting model to TensorRT. After this operation model get random bboxes and random calsses in random coordinates.

Could you (@nvidia) give recommendations to accelerating YoloV3 – ResNet18 in FP16 on Jetson Nano according to reffered doc?

is it possible to share your model and perf measurement steps?

I just checked the yolov3 network structure, it’s using NMS layer that is is TRT plugin, it should consume most of the inference time.
You can add “–dumpProfile” in the trtexec command. I think it’s because the time of this NMS layer time does not reduce so much from 960x544 to 320x320, so you don’t see expected fps.