Slow framerate with segnet-camera on FCN-alexnet

Hey all. I have a custom trained FCN-Alexnet with 5 classes, and I am running with segnet-camera from dusty-nv’s jetson-inference repo. The net seems to work well, but I am getting about 2 FPS through it. Is this normal? Or do I have something mis-configured.

I tried dropping the input resolution to 640X360 with no noticeable effect. I am displaying the output via HDMI to a monitor (Not using Xserver)

Also, is it normal for segnet-camera to get stuck (for about 5 minutes) on “[GIE] building CUDA engine” every time I run a new net?

Just tried running the Cityscape net, and it runs at <1FPS

Did you re-train the network at the lower resolution? The input resolution should have a noticeable impact to the speed. The segmentation nets are among the most demanding, have you tried running ~/ after startup?

The first time you run a new network, yes it is normal for TensorRT to take awhile to perform optimizations and profiling on the network (this is what it’s doing during “building CUDA engine”). The more complex the network, the longer it takes. After you build the CUDA engine the first time, it saves it in a .tensorcache file, so thereafter it should be much faster to load.

Another thing you may want to look into if you are optimizing performance, is the segmentation overlay in the example code isn’t optimized currently, you may want to disable that when testing the real performance. Referring to this and this part of the code.

Also, what performance do you get from segnet-console? segnet-console includes a call to net->EnabledProfiler() which prints out the layer times, this will give you a better measure of the runtime the network is consuming.

The net was originally trained on a 640X360 input image. I just changed the camera resolution to suit. And yeh, I tried running jetson_clocks. It helped a bit, but it was still pretty slow. Is 1 FPS what you would expect from the cityscape net?

If you are using the 2048×1024 Cityscapes (HD) model, you should be able to get a few FPS out of it with TensorRT. With caffe, only 0.25FPS.

Segmentation nets (like FCN-Alexnet) are good candidates for pruning to significantly speed them up (example)

Oooh nice!!

Here is the output with the included cityscape net:

[GIE]  layer shift - 11.461568 ms
[GIE]  layer conv1 + relu1 input reformatter 0 - 28.558399 ms
[GIE]  layer conv1 + relu1 - 147.546341 ms
[GIE]  layer pool1 - 11.208960 ms
[GIE]  layer norm1 - 3.560672 ms
[GIE]  layer conv2 + relu2 - 191.828384 ms
[GIE]  layer pool2 - 7.474016 ms
[GIE]  layer norm2 - 2.365472 ms
[GIE]  layer conv3 + relu3 - 77.937309 ms
[GIE]  layer conv4 + relu4 - 60.680447 ms
[GIE]  layer conv5 + relu5 - 40.529728 ms
[GIE]  layer pool5 - 1.888672 ms
[GIE]  layer fc6 + relu6 - 1059.710327 ms
[GIE]  layer fc7 + relu7 - 505.547668 ms
[GIE]  layer score_fr_21classes - 4.333216 ms
[GIE]  layer score_fr_21classes output reformatter 0 - 0.135104 ms
[GIE]  layer network time - 2154.766113 ms
[GIE]  segNet::Overlay -- s_w 58  s_h 26  s_c 21  s_x 0.028320  s_y 0.025391
[GIE]  segNet::Overlay -- ignoring class 'void' id=-1
segnet-console:  finished processing overlay  (1519659937767)
segnet-console:  completed saving 'test.png

And my net:

[GIE]  layer shift - 1.293184 ms
[GIE]  layer conv1 + relu1 input reformatter 0 - 3.188544 ms
[GIE]  layer conv1 + relu1 - 16.046848 ms
[GIE]  layer pool1 - 1.459680 ms
[GIE]  layer norm1 - 0.408256 ms
[GIE]  layer conv2 + relu2 - 21.276417 ms
[GIE]  layer pool2 - 1.405184 ms
[GIE]  layer norm2 - 0.494528 ms
[GIE]  layer conv3 + relu3 - 11.010240 ms
[GIE]  layer conv4 + relu4 - 8.643264 ms
[GIE]  layer conv5 + relu5 - 5.798432 ms
[GIE]  layer pool5 - 0.369216 ms
[GIE]  layer fc6 + relu6 - 88.395615 ms
[GIE]  layer fc7 + relu7 - 42.390625 ms
[GIE]  layer score_fr - 0.776352 ms
[GIE]  layer score_fr output reformatter 0 - 0.034912 ms
[GIE]  layer network time - 202.991287 ms
[GIE]  segNet::Overlay -- s_w 14  s_h 6  s_c 21  s_x 0.021875  s_y 0.016667
[GIE]  segNet::Overlay -- ignoring class 'void' id=-1

Strange, segnet-console predicts that my net would run in 202ms (4.95 FPS) yet it runs much slower (1FPS)
Any ideas?

Just commented out those for loops you referenced in segNet.cpp and it runs a lot closer to 5 FPS…

Those loops do the overlay, which is for human visualization. If you are doing machine vision for navigation purposes or similar, it is often preferred to have the robot operate directly from the low-res grid output of the segmentation network, which for all intensive purposes is already blobbed and easier to determine a vector of freespace. Performing the overlay consumes resources to upscale, followed by effectively scaling back down and blobbing so the machine has low-res enough data it can interpret. So for other than human viewing, that may be feasible to be eliminated all together.

Alternatively you could optimize the overlay function with CUDA so that it makes less of a performance impact, if you still needed it.

Sounds good. Thank you!