I’m running ssd-mobilenet v2 with Jetson-Inference with my-detection.py script from the example folder. I’m getting arround 100FPS but the results announced on different benchmarks on the web are more about 800FPS with ssd-mobilenet v1. Is it because I use python script ? How can I run ssd-mobilenet v1 with detectnet ?
Note that the official benchmarks are using INT8 precision, GPU + 2x DLA’s, and batching - whereas jetson-inference uses FP16, batch size 1, on GPU only.
If you want to run SSD-Mobilenet-v1, first download it using the Model Downloader tool. Then launch the detectnet program with the --network=ssd-mobilenet-v1 flag.
It’s not in the jetson-inference Python API to do it, and I haven’t tested DLA on object detection in jetson-inference. There is however this app from jetson-inference that runs GPU+2xDLA on image classification model from C++:
Thank you @dusty_nv, I’ve tried jetson-benchmark and I could get around 900FPS!
But impossible to know how I can set the input frames and get the output of inference, so how can I run ssd Mobilenet v1 using GPU + DLA and get results of inference (detection and coordinates) ? Also can I use .tf models with DLA and GPU, or only onnx?
You can use DLA as long as the layer are supported on DLA (see the compatibility matrix). If you have unsupported layers on DLA, you can use GPU fallback so that those layers run on GPU instead of DLA.
There are SSD samples that you can find under /usr/src/tensorrt/samples/ and /usr/src/tensorrt/samples/python that can show you the input/output data format of SSD.
How can I try to use activate DLA with Jetson Inference, and change precision ?
=> I tried "net = jetson.inference.detectNet(“pednet”, threshold=0.5, device=“DLA_0”, precision=“FP16”)
" with no success… (Exception: jetson.inference – detectNet.init() failed to parse args tuple)
I could finally use the DLA by changing the device in the file detectNet.h (device=DEVICE_DLA_0) and recompiled jetson-inference, and this is what I get :
[TRT] native precisions detected for DLA_0: FP32, FP16, INT8
[TRT] selecting fastest native precision for DLA_0: INT8
[TRT] attempting to open engine cache file /usr/local/bin/networks/ped-100/snapshot_iter_70800.caffemodel.1.1.DLA_0 .INT8.engine
[TRT] cache file not found, profiling network model on device DLA_0
[TRT] device DLA_0, loading /usr/local/bin/networks/ped-100/deploy.prototxt /usr/local/bin/networks/ped-100/snapsho t_iter_70800.caffemodel
[TRT] retrieved Output tensor “coverage”: 1x32x64
[TRT] retrieved Output tensor “bboxes”: 4x32x64
[TRT] device DLA_0, configuring CUDA engine
[TRT] retrieved Input tensor “data”: 3x512x1024
[TRT] warning: device DLA_0 using INT8 precision with RANDOM calibration
[TRT] device DLA_0, building FP16: OFF
[TRT] device DLA_0, building INT8: ON
[TRT] device DLA_0, building CUDA engine (this may take a few minutes the first time a network is loaded)
[TRT] Network built for DLA requires kENTROPY_CALIBRATION_2 calibrator.
[TRT] Network validation failed.
[TRT] device DLA_0, failed to build CUDA engine
[TRT] device DLA_0, failed to load networks/ped-100/snapshot_iter_70800.caffemodel
YES it worked with FP16 (I’ve not checked the results in the output but TRT seems to be running without any trouble), thanks @dusty_nv !
Now I’m trying to modify the jetson-inference lib to be able to launch 3 processes (GPU+DLA_0+DLA_1) in parallele, just by adding parameters (device and precision) in jetson.inference.detectNet (I’m using python btw). Anyone already did that ? => I mainly need to modify PyDetectNet_Init in PyDetectNet.cpp located in python/binding dir, right ?
I’ve changed the PyDetectNet_Init function but I have an error message (/PyDetectNet.cpp:536:3: error: ‘precisionType’ is not a member of ‘tensorNet’) :
Any idea of how to solve it ?
Also I was thinking of multiproccessing for running the 3 inferencers (GPU+2xDLA) instead of net->CreateStream(), what do you think ?
@dusty_nv thanks for your help, after a lot of time spent to understand and test, it works now :)
I get arround 36 FPS with pednet (512x1024 frames), wich is equivalent to arround 200 FPS in 300x300 (~19MPx/s). not so bad, but far from the 850FPS I got with mobilenet SSD V1 in jetson-benchmarks ! It seems that the GPU is able of 28 FPS (14,7 MPx/s) and the DLAs are about ~4FPS (2MPx/s, when all are running together).
My configuration :
GPU : Pednet INT8
DLA_0 : Pednet FP16
DLA_1 : Pednet FP16
Script in python3 using multiprocessing, cropping big frames of the camera in multiple thumbnails of 512x1024 sent to the inferencers via python queues. What do you think of that result ?
I could get arround 90FPS (8,1MPx/s) with mobilenet SSD on the GPU, not more (compared to 25FPS with pednet => 13,1MPx/s). Still trying to improve the results, do you know if there is a caffemodel of mobilenet SSD trained for pedestrians only (and with higher input size also) ?
Hi @Pelepicier, sorry for the delay - the SSD-Mobilenet models used in jetson-inference aren’t caffemodels, they are TensorFlow (UFF) and ONNX (PyTorch). The ONNX models are the ones that are re-trainable in the tutorial. So you could re-train it only on pedestrians and check the performance after an epoch.