Did you re-train the network at the lower resolution? The input resolution should have a noticeable impact to the speed. The segmentation nets are among the most demanding, have you tried running ~/jetson_clocks.sh after startup?
The first time you run a new network, yes it is normal for TensorRT to take awhile to perform optimizations and profiling on the network (this is what it’s doing during “building CUDA engine”). The more complex the network, the longer it takes. After you build the CUDA engine the first time, it saves it in a .tensorcache file, so thereafter it should be much faster to load.
Another thing you may want to look into if you are optimizing performance, is the segmentation overlay in the example code isn’t optimized currently, you may want to disable that when testing the real performance. Referring to this and this part of the code.
Also, what performance do you get from segnet-console? segnet-console includes a call to net->EnabledProfiler() which prints out the layer times, this will give you a better measure of the runtime the network is consuming.