Low performance on Jetson NX using the Deepstream python apps MaskRCNN

ioavgous · June 7, 2023, 11:20am

• Hardware Xavier NX
• Network Type Mask_rcnn
• TAO version 4.0.1
• Jetpack 5.1
• Deepstream 6.2

Hello, I am using the deepstream python apps in a Jetson Xavier NX and the default model PeopleSegnet in seems that it gives me 5 fps running the sample_qHD.mp4 demo video. After successfuly deploying the default model I have tried to train a custom MaskRCNN model using the TAO instance segmentation jupyter notebook and exported a <model_name>.etlt model file. Just for clarification purposes , apart from the dataset i have not made any code modication on the notebook and also the machine running is an aws instance.
It is worthe mentioning that i have reduced the number of epochs in order to generate faster a model file.
After deploying this model on NX using the deepstream python application, it seems that I have a huge performance bottleneck getting 0-0.2 fps. I would like to mention that i have tried with both fp32 and fp16 network modes in deepstream configuration files.
After following the TAO documentation ( TAO — Tao Toolkit ) and the technincal blog for segmentation ( Training Instance Segmentation Models Using Mask R-CNN on the NVIDIA TAO Toolkit | NVIDIA Technical Blog ) it seems that I have followed the proper steps but the performance issue persists. Do you have any advice in order to get the proper fps performance metrics as seen in the technical blog’s Figure 4?
I am attaching the TAO and deepstream spec files which produces the issue as described.

deepstream_configuration.txt

[property]
gpu-id=0
net-scale-factor=0.017507
offsets=123.675;116.280;103.53
model-color-format=0
labelfile-path=models/labels.txt
tlt-encoded-model=models/<model_name>.etlt
tlt-model-key=nvidia_tlt
model-engine-file=models/<model_name>.etlt_b1_gpu0_fp16.engine
#int8-calib-file=models/tests/maskrcnn.cal
infer-dims=3;832;1344
uff-input-blob-name=Input
batch-size=1

0=FP32, 1=INT8, 2=FP16 mode

network-mode=2 #tried both with 0 and 1
num-detected-classes=2
interval=0
gie-unique-id=1
network-type=3
output-blob-names=generate_detections;mask_fcn_logits/BiasAdd
parse-bbox-instance-mask-func-name=NvDsInferParseCustomMrcnnTLTV2
custom-lib-path=/opt/nvidia/deepstream/deepstream/lib/libnvds_infercustomparser.so
#no cluster
cluster-mode=4
output-instance-mask=1

[class-attrs-all]
pre-cluster-threshold=0.8

training_spec_file.txt
seed: 747
use_amp: True
warmup_steps: 1000
checkpoint: “/workspace/tao-experiments/mask_rcnn/pretrained_resnet50/pretrained_instance_segmentation_vresnet50/resnet50.hdf5”
learning_rate_steps: “[10000]”
learning_rate_decay_levels: “[0.1]”
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 25000
momentum: 0.9
l2_weight_decay: 0.00004
warmup_learning_rate: 0.0001
init_learning_rate: 0.005
num_examples_per_epoch: 100

data_config{
image_size: “(832, 1344)”
augment_input_data: True
eval_samples: 100
training_file_pattern: “/workspace/tao-experiments/data/train*.tfrecord”
validation_file_pattern: “/workspace/tao-experiments/data/val*.tfrecord”
val_json_file: “/workspace/tao-experiments/data/raw-data/annotations/instances_val2017.json”

# dataset specific parameters
num_classes: 2
skip_crowd_during_training: True

}

maskrcnn_config {
nlayers: 50
arch: “resnet”
freeze_bn: True
freeze_blocks: “[0,1]”
gt_mask_size: 112

# Region Proposal Network
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_min_size: 0.

# Proposal layer.
batch_size_per_im: 512
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.

# Faster-RCNN heads.
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: "(10., 10., 5., 5.)"

# Mask-RCNN heads.
include_mask: True
mrcnn_resolution: 28

# training
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7

# evaluation
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000
test_rpn_nms_thresh: 0.7

# model architecture
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
anchor_scale: 8

# localization loss
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0

}

retraining_spec_file.txt the same as training spec file

Thanks in advance

yuweiw · June 8, 2023, 5:23am

You can use the trtexec command to test the performace without deepstream first.

trtexec --loadEngine=your_enfine_file --fp16

ioavgous · June 8, 2023, 7:49am

Thank you for your prompt reply.
I am posting the profiling performance from the custom model and the peoplesegnet just for a comparison using the trtexec command line.

custom_model.engine

[06/08/2023-10:31:55] [I] === Model Options ===
[06/08/2023-10:31:55] [I] Format: *
[06/08/2023-10:31:55] [I] Model:
[06/08/2023-10:31:55] [I] Output:
[06/08/2023-10:31:55] [I] === Build Options ===
[06/08/2023-10:31:55] [I] Max batch: 1
[06/08/2023-10:31:55] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/08/2023-10:31:55] [I] minTiming: 1
[06/08/2023-10:31:55] [I] avgTiming: 8
[06/08/2023-10:31:55] [I] Precision: FP32+FP16
[06/08/2023-10:31:55] [I] LayerPrecisions:
[06/08/2023-10:31:55] [I] Calibration:
[06/08/2023-10:31:55] [I] Refit: Disabled
[06/08/2023-10:31:55] [I] Sparsity: Disabled
[06/08/2023-10:31:55] [I] Safe mode: Disabled
[06/08/2023-10:31:55] [I] DirectIO mode: Disabled
[06/08/2023-10:31:55] [I] Restricted mode: Disabled
[06/08/2023-10:31:55] [I] Build only: Disabled
[06/08/2023-10:31:55] [I] Save engine:
[06/08/2023-10:31:55] [I] Load engine: models/tests2/model_pruned_25000.etlt_b1_gpu0_fp16.engine
[06/08/2023-10:31:55] [I] Profiling verbosity: 0
[06/08/2023-10:31:55] [I] Tactic sources: Using default tactic sources
[06/08/2023-10:31:55] [I] timingCacheMode: local
[06/08/2023-10:31:55] [I] timingCacheFile:
[06/08/2023-10:31:55] [I] Heuristic: Disabled
[06/08/2023-10:31:55] [I] Preview Features: Use default preview flags.
[06/08/2023-10:31:55] [I] Input(s)s format: fp32:CHW
[06/08/2023-10:31:55] [I] Output(s)s format: fp32:CHW
[06/08/2023-10:31:55] [I] Input build shapes: model
[06/08/2023-10:31:55] [I] Input calibration shapes: model
[06/08/2023-10:31:55] [I] === System Options ===
[06/08/2023-10:31:55] [I] Device: 0
[06/08/2023-10:31:55] [I] DLACore:
[06/08/2023-10:31:55] [I] Plugins:
[06/08/2023-10:31:55] [I] === Inference Options ===
[06/08/2023-10:31:55] [I] Batch: 1
[06/08/2023-10:31:55] [I] Input inference shapes: model
[06/08/2023-10:31:55] [I] Iterations: 10
[06/08/2023-10:31:55] [I] Duration: 3s (+ 200ms warm up)
[06/08/2023-10:31:55] [I] Sleep time: 0ms
[06/08/2023-10:31:55] [I] Idle time: 0ms
[06/08/2023-10:31:55] [I] Streams: 1
[06/08/2023-10:31:55] [I] ExposeDMA: Disabled
[06/08/2023-10:31:55] [I] Data transfers: Enabled
[06/08/2023-10:31:55] [I] Spin-wait: Disabled
[06/08/2023-10:31:55] [I] Multithreading: Disabled
[06/08/2023-10:31:55] [I] CUDA Graph: Disabled
[06/08/2023-10:31:55] [I] Separate profiling: Disabled
[06/08/2023-10:31:55] [I] Time Deserialize: Disabled
[06/08/2023-10:31:55] [I] Time Refit: Disabled
[06/08/2023-10:31:55] [I] NVTX verbosity: 0
[06/08/2023-10:31:55] [I] Persistent Cache Ratio: 0
[06/08/2023-10:31:55] [I] Inputs:
[06/08/2023-10:31:55] [I] === Reporting Options ===
[06/08/2023-10:31:55] [I] Verbose: Enabled
[06/08/2023-10:31:55] [I] Averages: 10 inferences
[06/08/2023-10:31:55] [I] Percentiles: 90,95,99
[06/08/2023-10:31:55] [I] Dump refittable layers:Disabled
[06/08/2023-10:31:55] [I] Dump output: Disabled
[06/08/2023-10:31:55] [I] Profile: Disabled
[06/08/2023-10:31:55] [I] Export timing to JSON file:
[06/08/2023-10:31:55] [I] Export output to JSON file:
[06/08/2023-10:31:55] [I] Export profile to JSON file:
[06/08/2023-10:31:55] [I]
[06/08/2023-10:31:55] [I] === Device Information ===
[06/08/2023-10:31:55] [I] Selected Device: Xavier
[06/08/2023-10:31:55] [I] Compute Capability: 7.2
[06/08/2023-10:31:55] [I] SMs: 6
[06/08/2023-10:31:55] [I] Compute Clock Rate: 1.109 GHz
[06/08/2023-10:31:55] [I] Device Global Memory: 6857 MiB
[06/08/2023-10:31:55] [I] Shared Memory per SM: 96 KiB
[06/08/2023-10:31:55] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/08/2023-10:31:55] [I] Memory Clock Rate: 1.109 GHz
[06/08/2023-10:31:55] [I]
[06/08/2023-10:31:55] [I] TensorRT version: 8.5.2
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::GroupNorm version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::LayerNorm version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::Proposal version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::SeqLen2Spatial version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::SplitGeLU version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::Split version 1
[06/08/2023-10:31:55] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[06/08/2023-10:31:55] [I] Engine loaded in 0.0105688 sec.
[06/08/2023-10:31:57] [I] [TRT] Loaded engine size: 2 MiB
[06/08/2023-10:31:57] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[06/08/2023-10:31:57] [V] [TRT] Trying to load shared library libcublas.so.11
[06/08/2023-10:31:57] [V] [TRT] Loaded shared library libcublas.so.11
[06/08/2023-10:31:59] [V] [TRT] Using cublas as plugin tactic source
[06/08/2023-10:31:59] [V] [TRT] Trying to load shared library libcublasLt.so.11
[06/08/2023-10:31:59] [V] [TRT] Loaded shared library libcublasLt.so.11
[06/08/2023-10:31:59] [V] [TRT] Using cublasLt as core library tactic source
[06/08/2023-10:31:59] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +261, GPU +245, now: CPU 506, GPU 3491 (MiB)
[06/08/2023-10:31:59] [V] [TRT] Trying to load shared library libcudnn.so.8
[06/08/2023-10:31:59] [V] [TRT] Loaded shared library libcudnn.so.8
[06/08/2023-10:31:59] [V] [TRT] Using cuDNN as plugin tactic source
[06/08/2023-10:32:00] [V] [TRT] Using cuDNN as core library tactic source
[06/08/2023-10:32:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +78, now: CPU 588, GPU 3569 (MiB)
[06/08/2023-10:32:00] [V] [TRT] Deserialization required 2784179 microseconds.
[06/08/2023-10:32:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 1 (MiB)
[06/08/2023-10:32:00] [I] Engine deserialized in 4.41861 sec.
[06/08/2023-10:32:00] [V] [TRT] Trying to load shared library libcublas.so.11
[06/08/2023-10:32:00] [V] [TRT] Loaded shared library libcublas.so.11
[06/08/2023-10:32:00] [V] [TRT] Using cublas as plugin tactic source
[06/08/2023-10:32:00] [V] [TRT] Trying to load shared library libcublasLt.so.11
[06/08/2023-10:32:00] [V] [TRT] Loaded shared library libcublasLt.so.11
[06/08/2023-10:32:00] [V] [TRT] Using cublasLt as core library tactic source
[06/08/2023-10:32:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 599, GPU 3579 (MiB)
[06/08/2023-10:32:00] [V] [TRT] Trying to load shared library libcudnn.so.8
[06/08/2023-10:32:00] [V] [TRT] Loaded shared library libcudnn.so.8
[06/08/2023-10:32:00] [V] [TRT] Using cuDNN as plugin tactic source
[06/08/2023-10:32:00] [V] [TRT] Using cuDNN as core library tactic source
[06/08/2023-10:32:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 599, GPU 3579 (MiB)
[06/08/2023-10:32:00] [V] [TRT] Total per-runner device persistent memory is 326144
[06/08/2023-10:32:00] [V] [TRT] Total per-runner host persistent memory is 260096
[06/08/2023-10:32:00] [V] [TRT] Allocated activation device memory of size 138439680
[06/08/2023-10:32:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +133, now: CPU 0, GPU 134 (MiB)
[06/08/2023-10:32:00] [I] Setting persistentCacheLimit to 0 bytes.
[06/08/2023-10:32:00] [I] Using random values for input Input
[06/08/2023-10:32:00] [I] Created input binding for Input with dimensions 3x832x1344
[06/08/2023-10:32:00] [I] Using random values for output generate_detections
[06/08/2023-10:32:00] [I] Created output binding for generate_detections with dimensions 100x6
[06/08/2023-10:32:00] [I] Using random values for output mask_fcn_logits/BiasAdd
[06/08/2023-10:32:00] [I] Created output binding for mask_fcn_logits/BiasAdd with dimensions 100x2x28x28
[06/08/2023-10:32:00] [I] Starting inference
[06/08/2023-10:32:04] [I] Warmup completed 1 queries over 200 ms
[06/08/2023-10:32:04] [I] Timing trace has 33 queries over 3.43372 s
[06/08/2023-10:32:04] [I]
[06/08/2023-10:32:04] [I] === Trace details ===
[06/08/2023-10:32:04] [I] Trace averages of 10 runs:
[06/08/2023-10:32:04] [I] Average on 10 runs - GPU latency: 98.4642 ms - Host latency: 99.801 ms (enqueue 5.24156 ms)
[06/08/2023-10:32:04] [I] Average on 10 runs - GPU latency: 97.2185 ms - Host latency: 98.1436 ms (enqueue 4.80054 ms)
[06/08/2023-10:32:04] [I] Average on 10 runs - GPU latency: 97.1906 ms - Host latency: 98.0993 ms (enqueue 4.36616 ms)
[06/08/2023-10:32:04] [I]
[06/08/2023-10:32:04] [I] === Performance summary ===
[06/08/2023-10:32:04] [I] Throughput: 9.61058 qps
[06/08/2023-10:32:04] [I] Latency: min = 97.6519 ms, max = 114.77 ms, mean = 98.6165 ms, median = 98.1255 ms, percentile(90%) = 98.1974 ms, percentile(95%) = 98.2034 ms, percentile(99%) = 114.77 ms
[06/08/2023-10:32:04] [I] Enqueue Time: min = 3.64551 ms, max = 6.41525 ms, mean = 4.72037 ms, median = 4.64966 ms, percentile(90%) = 5.66675 ms, percentile(95%) = 6.01855 ms, percentile(99%) = 6.41525 ms
[06/08/2023-10:32:04] [I] H2D Latency: min = 0.743896 ms, max = 4.81674 ms, mean = 0.967748 ms, median = 0.852783 ms, percentile(90%) = 0.876465 ms, percentile(95%) = 0.888229 ms, percentile(99%) = 4.81674 ms
[06/08/2023-10:32:04] [I] GPU Compute Time: min = 96.7417 ms, max = 109.876 ms, mean = 97.5714 ms, median = 97.1916 ms, percentile(90%) = 97.2577 ms, percentile(95%) = 97.3153 ms, percentile(99%) = 109.876 ms
[06/08/2023-10:32:04] [I] D2H Latency: min = 0.0302734 ms, max = 0.0866699 ms, mean = 0.0773695 ms, median = 0.0789795 ms, percentile(90%) = 0.0834961 ms, percentile(95%) = 0.0852051 ms, percentile(99%) = 0.0866699 ms
[06/08/2023-10:32:04] [I] Total Host Walltime: 3.43372 s
[06/08/2023-10:32:04] [I] Total GPU Compute Time: 3.21986 s
[06/08/2023-10:32:04] [W] * GPU compute time is unstable, with coefficient of variance = 2.23119%.
[06/08/2023-10:32:04] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/08/2023-10:32:04] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/08/2023-10:32:04] [V]

peoplesegnet_fp16.engine
[06/08/2023-10:39:04] [I] === Model Options ===
[06/08/2023-10:39:04] [I] Format: *
[06/08/2023-10:39:04] [I] Model:
[06/08/2023-10:39:04] [I] Output:
[06/08/2023-10:39:04] [I] === Build Options ===
[06/08/2023-10:39:04] [I] Max batch: 1
[06/08/2023-10:39:04] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/08/2023-10:39:04] [I] minTiming: 1
[06/08/2023-10:39:04] [I] avgTiming: 8
[06/08/2023-10:39:04] [I] Precision: FP32+FP16
[06/08/2023-10:39:04] [I] LayerPrecisions:
[06/08/2023-10:39:04] [I] Calibration:
[06/08/2023-10:39:04] [I] Refit: Disabled
[06/08/2023-10:39:04] [I] Sparsity: Disabled
[06/08/2023-10:39:04] [I] Safe mode: Disabled
[06/08/2023-10:39:04] [I] DirectIO mode: Disabled
[06/08/2023-10:39:04] [I] Restricted mode: Disabled
[06/08/2023-10:39:04] [I] Build only: Disabled
[06/08/2023-10:39:04] [I] Save engine:
[06/08/2023-10:39:04] [I] Load engine: models/peoplesegnet_resnet50.etlt_b1_gpu0_fp16.engine
[06/08/2023-10:39:04] [I] Profiling verbosity: 0
[06/08/2023-10:39:04] [I] Tactic sources: Using default tactic sources
[06/08/2023-10:39:04] [I] timingCacheMode: local
[06/08/2023-10:39:04] [I] timingCacheFile:
[06/08/2023-10:39:04] [I] Heuristic: Disabled
[06/08/2023-10:39:04] [I] Preview Features: Use default preview flags.
[06/08/2023-10:39:04] [I] Input(s)s format: fp32:CHW
[06/08/2023-10:39:04] [I] Output(s)s format: fp32:CHW
[06/08/2023-10:39:04] [I] Input build shapes: model
[06/08/2023-10:39:04] [I] Input calibration shapes: model
[06/08/2023-10:39:04] [I] === System Options ===
[06/08/2023-10:39:04] [I] Device: 0
[06/08/2023-10:39:04] [I] DLACore:
[06/08/2023-10:39:04] [I] Plugins:
[06/08/2023-10:39:04] [I] === Inference Options ===
[06/08/2023-10:39:04] [I] Batch: 1
[06/08/2023-10:39:04] [I] Input inference shapes: model
[06/08/2023-10:39:04] [I] Iterations: 10
[06/08/2023-10:39:04] [I] Duration: 3s (+ 200ms warm up)
[06/08/2023-10:39:04] [I] Sleep time: 0ms
[06/08/2023-10:39:04] [I] Idle time: 0ms
[06/08/2023-10:39:04] [I] Streams: 1
[06/08/2023-10:39:04] [I] ExposeDMA: Disabled
[06/08/2023-10:39:04] [I] Data transfers: Enabled
[06/08/2023-10:39:04] [I] Spin-wait: Disabled
[06/08/2023-10:39:04] [I] Multithreading: Disabled
[06/08/2023-10:39:04] [I] CUDA Graph: Disabled
[06/08/2023-10:39:04] [I] Separate profiling: Disabled
[06/08/2023-10:39:04] [I] Time Deserialize: Disabled
[06/08/2023-10:39:04] [I] Time Refit: Disabled
[06/08/2023-10:39:04] [I] NVTX verbosity: 0
[06/08/2023-10:39:04] [I] Persistent Cache Ratio: 0
[06/08/2023-10:39:04] [I] Inputs:
[06/08/2023-10:39:04] [I] === Reporting Options ===
[06/08/2023-10:39:04] [I] Verbose: Enabled
[06/08/2023-10:39:04] [I] Averages: 10 inferences
[06/08/2023-10:39:04] [I] Percentiles: 90,95,99
[06/08/2023-10:39:04] [I] Dump refittable layers:Disabled
[06/08/2023-10:39:04] [I] Dump output: Disabled
[06/08/2023-10:39:04] [I] Profile: Disabled
[06/08/2023-10:39:04] [I] Export timing to JSON file:
[06/08/2023-10:39:04] [I] Export output to JSON file:
[06/08/2023-10:39:04] [I] Export profile to JSON file:
[06/08/2023-10:39:04] [I]
[06/08/2023-10:39:04] [I] === Device Information ===
[06/08/2023-10:39:04] [I] Selected Device: Xavier
[06/08/2023-10:39:04] [I] Compute Capability: 7.2
[06/08/2023-10:39:04] [I] SMs: 6
[06/08/2023-10:39:04] [I] Compute Clock Rate: 1.109 GHz
[06/08/2023-10:39:04] [I] Device Global Memory: 6857 MiB
[06/08/2023-10:39:04] [I] Shared Memory per SM: 96 KiB
[06/08/2023-10:39:04] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/08/2023-10:39:04] [I] Memory Clock Rate: 1.109 GHz
[06/08/2023-10:39:04] [I]
[06/08/2023-10:39:04] [I] TensorRT version: 8.5.2
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::GroupNorm version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::LayerNorm version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::Proposal version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::SeqLen2Spatial version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::SplitGeLU version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::Split version 1
[06/08/2023-10:39:04] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[06/08/2023-10:39:04] [I] Engine loaded in 0.0766629 sec.
[06/08/2023-10:39:05] [I] [TRT] Loaded engine size: 36 MiB
[06/08/2023-10:39:06] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[06/08/2023-10:39:06] [V] [TRT] Trying to load shared library libcublas.so.11
[06/08/2023-10:39:06] [V] [TRT] Loaded shared library libcublas.so.11
[06/08/2023-10:39:08] [V] [TRT] Using cublas as plugin tactic source
[06/08/2023-10:39:08] [V] [TRT] Trying to load shared library libcublasLt.so.11
[06/08/2023-10:39:08] [V] [TRT] Loaded shared library libcublasLt.so.11
[06/08/2023-10:39:08] [V] [TRT] Using cublasLt as core library tactic source
[06/08/2023-10:39:08] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +261, GPU +245, now: CPU 539, GPU 3515 (MiB)
[06/08/2023-10:39:08] [V] [TRT] Trying to load shared library libcudnn.so.8
[06/08/2023-10:39:08] [V] [TRT] Loaded shared library libcudnn.so.8
[06/08/2023-10:39:08] [V] [TRT] Using cuDNN as plugin tactic source
[06/08/2023-10:39:08] [V] [TRT] Using cuDNN as core library tactic source
[06/08/2023-10:39:08] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +79, now: CPU 621, GPU 3594 (MiB)
[06/08/2023-10:39:08] [V] [TRT] Deserialization required 2698592 microseconds.
[06/08/2023-10:39:08] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +36, now: CPU 0, GPU 36 (MiB)
[06/08/2023-10:39:08] [I] Engine deserialized in 4.37531 sec.
[06/08/2023-10:39:08] [V] [TRT] Trying to load shared library libcublas.so.11
[06/08/2023-10:39:08] [V] [TRT] Loaded shared library libcublas.so.11
[06/08/2023-10:39:08] [V] [TRT] Using cublas as plugin tactic source
[06/08/2023-10:39:08] [V] [TRT] Trying to load shared library libcublasLt.so.11
[06/08/2023-10:39:08] [V] [TRT] Loaded shared library libcublasLt.so.11
[06/08/2023-10:39:08] [V] [TRT] Using cublasLt as core library tactic source
[06/08/2023-10:39:08] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 632, GPU 3603 (MiB)
[06/08/2023-10:39:08] [V] [TRT] Trying to load shared library libcudnn.so.8
[06/08/2023-10:39:08] [V] [TRT] Loaded shared library libcudnn.so.8
[06/08/2023-10:39:08] [V] [TRT] Using cuDNN as plugin tactic source
[06/08/2023-10:39:08] [V] [TRT] Using cuDNN as core library tactic source
[06/08/2023-10:39:08] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 632, GPU 3603 (MiB)
[06/08/2023-10:39:08] [V] [TRT] Total per-runner device persistent memory is 1147392
[06/08/2023-10:39:08] [V] [TRT] Total per-runner host persistent memory is 263104
[06/08/2023-10:39:08] [V] [TRT] Allocated activation device memory of size 99675136
[06/08/2023-10:39:08] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +96, now: CPU 0, GPU 132 (MiB)
[06/08/2023-10:39:08] [I] Setting persistentCacheLimit to 0 bytes.
[06/08/2023-10:39:08] [I] Using random values for input Input
[06/08/2023-10:39:09] [I] Created input binding for Input with dimensions 3x576x960
[06/08/2023-10:39:09] [I] Using random values for output generate_detections
[06/08/2023-10:39:09] [I] Created output binding for generate_detections with dimensions 100x6
[06/08/2023-10:39:09] [I] Using random values for output mask_fcn_logits/BiasAdd
[06/08/2023-10:39:09] [I] Created output binding for mask_fcn_logits/BiasAdd with dimensions 100x2x28x28
[06/08/2023-10:39:09] [I] Starting inference
[06/08/2023-10:39:12] [I] Warmup completed 1 queries over 200 ms
[06/08/2023-10:39:12] [I] Timing trace has 22 queries over 3.4116 s
[06/08/2023-10:39:12] [I]
[06/08/2023-10:39:12] [I] === Trace details ===
[06/08/2023-10:39:12] [I] Trace averages of 10 runs:
[06/08/2023-10:39:12] [I] Average on 10 runs - GPU latency: 187.028 ms - Host latency: 187.784 ms (enqueue 4.59743 ms)
[06/08/2023-10:39:12] [I] Average on 10 runs - GPU latency: 102.534 ms - Host latency: 102.927 ms (enqueue 4.06294 ms)
[06/08/2023-10:39:12] [I]
[06/08/2023-10:39:12] [I] === Performance summary ===
[06/08/2023-10:39:12] [I] Throughput: 6.44859 qps
[06/08/2023-10:39:12] [I] Latency: min = 102.583 ms, max = 210.768 ms, mean = 141.486 ms, median = 103.036 ms, percentile(90%) = 209.078 ms, percentile(95%) = 209.108 ms, percentile(99%) = 210.768 ms
[06/08/2023-10:39:12] [I] Enqueue Time: min = 3.78613 ms, max = 6.28809 ms, mean = 4.31235 ms, median = 4.20251 ms, percentile(90%) = 4.64038 ms, percentile(95%) = 5.67135 ms, percentile(99%) = 6.28809 ms
[06/08/2023-10:39:12] [I] H2D Latency: min = 0.310791 ms, max = 2.39571 ms, mean = 0.496043 ms, median = 0.352783 ms, percentile(90%) = 0.526855 ms, percentile(95%) = 0.57019 ms, percentile(99%) = 2.39571 ms
[06/08/2023-10:39:12] [I] GPU Compute Time: min = 102.175 ms, max = 208.511 ms, mean = 140.929 ms, median = 102.662 ms, percentile(90%) = 208.47 ms, percentile(95%) = 208.482 ms, percentile(99%) = 208.511 ms
[06/08/2023-10:39:12] [I] D2H Latency: min = 0.0307617 ms, max = 0.0715942 ms, mean = 0.0609048 ms, median = 0.05896 ms, percentile(90%) = 0.0707397 ms, percentile(95%) = 0.0708008 ms, percentile(99%) = 0.0715942 ms
[06/08/2023-10:39:12] [I] Total Host Walltime: 3.4116 s
[06/08/2023-10:39:12] [I] Total GPU Compute Time: 3.10043 s
[06/08/2023-10:39:12] [W] * GPU compute time is unstable, with coefficient of variance = 35.2005%.
[06/08/2023-10:39:12] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/08/2023-10:39:12] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/08/2023-10:39:12] [V]

By comparing the performance of the two models using the Latency metric (custom 98.6165 ms-peoplesegnet 141.486 ms) it seems that the custom model has a faster response of a single inference.
But in deepstream application the performance in terms of fps as discussed in my previous question is significant lower. What do you think about the profiling results? What should be the next debugging steps?

yuweiw · June 8, 2023, 9:25am

Could you attach your whole pipeline? You can test the latency of the plugins by referring to: https://forums.developer.nvidia.com/t/deepstream-sdk-faq/80236/12
Also, what method did you use to obtain the frame rate 0-0.2 fps？

ioavgous · June 13, 2023, 10:58am

Greetings,

It seems that I have found the source of the bottleneck of the performance. I had minor modified the python deepstream app named segmask.py mainly in the function def tiler_sink_pad_buffer_probe(pad, info, u_data).

deepstream_segmask.py

#!/usr/bin/env python3

################################################################################

SPDX-FileCopyrightText: Copyright (c) 2020-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the “License”);

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

Apache License, Version 2.0 | Apache Software Foundations

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an “AS IS” BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

################################################################################

import sys
from syslog import LOG_WARNING

sys.path.append(‘../’)
import gi
import configparser

gi.require_version(‘Gst’, ‘1.0’)
from gi.repository import GLib, Gst
from ctypes import *
import time
import sys
import math
import platform
from common.is_aarch_64 import is_aarch64
from common.bus_call import bus_call
from common.FPS import PERF_DATA
import numpy as np
import pyds
import cv2
import os
import os.path
from os import path
import argparse
try:
import debugpy
except:
debugpy = None
import json
import socket

Create a socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

Connect to the other script

host = ‘localhost’ # Replace with the host IP address
port = 12345 # Replace with the desired port number
sock.connect((host, port))

perf_data = None

MAX_DISPLAY_LEN = 64
MUXER_OUTPUT_WIDTH = 1344
MUXER_OUTPUT_HEIGHT = 832
MUXER_BATCH_TIMEOUT_USEC = 4000000
TILED_OUTPUT_WIDTH = 1344
TILED_OUTPUT_HEIGHT = 832
GST_CAPS_FEATURES_NVMM = “memory:NVMM”

tiler_sink_pad_buffer_probe will extract metadata received on tiler sink pad

and re-size and binarize segmentation mask array to save to image

def tiler_sink_pad_buffer_probe(pad, info, u_data):
frame_number = 0
num_rects = 0
gst_buffer = info.get_buffer()
if not gst_buffer:
print("Unable to get GstBuffer ")
return

# Retrieve batch metadata from the gst_buffer
# Note that pyds.gst_buffer_get_nvds_batch_meta() expects the
# C address of gst_buffer as input, which is obtained with hash(gst_buffer)
batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))

l_frame = batch_meta.frame_meta_list
while l_frame is not None:
    try:
        # Note that l_frame.data needs a cast to pyds.NvDsFrameMeta
        # The casting is done by pyds.NvDsFrameMeta.cast()
        # The casting also keeps ownership of the underlying memory
        # in the C code, so the Python garbage collector will leave
        # it alone.
        frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
    except StopIteration:
        break

    frame_number = frame_meta.frame_num
    l_obj = frame_meta.obj_meta_list
    num_rects = frame_meta.num_obj_meta
    is_first_obj = True
    save_image = False
    obj_number = 0
    black_image = np.zeros((832, 1344), dtype=np.uint8)
    cv2.imwrite("/home/administrator/black_frame.jpg", black_image)

    while l_obj is not None:
        try:
            # Casting l_obj.data to pyds.NvDsObjectMeta
            obj_meta = pyds.NvDsObjectMeta.cast(l_obj.data)
        except StopIteration:
            break
        rectparams = obj_meta.rect_params # Retrieve rectparams for re-sizing mask to correct dims
        maskparams = obj_meta.mask_params # Retrieve maskparams
        print(rectparams.left, rectparams.top)
        mask_image = 255 * np.ones((int(rectparams.height),int(rectparams.width)), np.uint8)
        cv2.imwrite("/home/administrator/white_frame.jpg", mask_image)
        x = int(rectparams.left)
        y = int(rectparams.top)

        black_image[y:y + int(rectparams.height),x:x + int(rectparams.width)] = mask_image            
        
        upper_part_LWA = black_image[0:180,640:960]
        middle_part_LWA = black_image[180:360,640:960]
        lower_part_LWA = black_image[360:540,640:960]

        cv2.imwrite("/home/administrator/yolo_upper.jpg", upper_part_LWA) # Save mask to image
        cv2.imwrite("/home/administrator/yolo_middle.jpg", middle_part_LWA) # Save mask to image
        cv2.imwrite("/home/administrator/yolo_lower.jpg", lower_part_LWA) # Save mask to image
        count_upper = np.count_nonzero(upper_part_LWA == 255)
        count_middle = np.count_nonzero(middle_part_LWA == 255)
        count_lower = np.count_nonzero(lower_part_LWA == 255)
        print("Total white pixels in upper part: ",count_upper)
        print("Total white pixels in middle part: ",count_middle)
        print("Total white pixels in lower part: ",count_lower)
        print("Percentage in upper part ",(count_upper/upper_part_LWA.size)*100)
        print("Percentage in middle part ",(count_middle/middle_part_LWA.size)*100)
        print("Percentage in lower part ",(count_lower/lower_part_LWA.size)*100)

        # Create a dictionary to store the values
        data = {
            'upper_part': (count_upper/upper_part_LWA.size)*100,
            'middle_part': (count_middle/middle_part_LWA.size)*100,
            'lower_part': (count_lower/lower_part_LWA.size)*100
        }

        # Serialize the data to JSON
        json_data = json.dumps(data)

        # Send the JSON data over the socket
        sock.sendall(json_data.encode())
        
        #mask_image = resize_mask(maskparams, math.floor(rectparams.width), math.floor(rectparams.height)) # Get resized mask array
        #print(rectparams,maskparams)
        #black_image[y:y + int(rectparams.height),x:x + int(rectparams.width)] = mask_image
        #cv2.imwrite("/home/administrator/frame.jpg", black_image)
        #print(mask_image)
        #if is_first_obj and frame_number % 30 == 0:
        #    print("mpika edw")
        #    is_first_obj = False
        #    rectparams = obj_meta.rect_params # Retrieve rectparams for re-sizing mask to correct dims
        #    maskparams = obj_meta.mask_params # Retrieve maskparams
        #    mask_image = resize_mask(maskparams, math.floor(rectparams.width), math.floor(rectparams.height)) # Get resized mask array
        #    x = int(rectparams.left)
        #    y = int(rectparams.top)
        #    black_image[y:y + int(rectparams.height),x:x + int(rectparams.width)] = mask_image

            #img_path = "{}/stream_{}/frame_{}.jpg".format(folder_name, frame_meta.pad_index, frame_number)
            #cv2.imwrite(img_path, mask_image) # Save mask to image
        
        try:
            l_obj = l_obj.next
            obj_number += 1
        except StopIteration:
            break
    cv2.imwrite("/home/administrator/black_frame_masks.jpg", black_image)

    print("Frame Number=", frame_number, "Number of Objects=", num_rects)
    # update frame rate through this probe
    stream_index = "stream{0}".format(frame_meta.pad_index)
    global perf_data
    perf_data.update_fps(stream_index)
    try:
        l_frame = l_frame.next
    except StopIteration:
        break

return Gst.PadProbeReturn.OK

def clip(val, low, high):
if val < low:
return low
elif val > high:
return high
else:
return val

Resize and binarize mask array for interpretable segmentation mask

def resize_mask(maskparams, target_width, target_height):
src = maskparams.get_mask_array() # Retrieve mask array
dst = np.empty((target_height, target_width), src.dtype) # Initialize array to store re-sized mask
original_width = maskparams.width
original_height = maskparams.height
ratio_h = float(original_height) / float(target_height)
ratio_w = float(original_width) / float(target_width)
threshold = maskparams.threshold
channel = 1

# Resize from original width/height to target width/height 
for y in range(target_height):
    for x in range(target_width):
        print("mpika edw")
        x0 = float(x) * ratio_w
        y0 = float(y) * ratio_h
        left = int(clip(math.floor(x0), 0.0, float(original_width - 1.0)))
        top = int(clip(math.floor(y0), 0.0, float(original_height - 1.0)))
        right = int(clip(math.ceil(x0), 0.0, float(original_width - 1.0)))
        bottom = int(clip(math.ceil(y0), 0.0, float(original_height - 1.0)))

        for c in range(channel):
            # H, W, C ordering
            # Note: lerp is shorthand for linear interpolation
            left_top_val = float(src[top * (original_width * channel) + left * (channel) + c])
            right_top_val = float(src[top * (original_width * channel) + right * (channel) + c])
            left_bottom_val = float(src[bottom * (original_width * channel) + left * (channel) + c])
            right_bottom_val = float(src[bottom * (original_width * channel) + right * (channel) + c])
            top_lerp = left_top_val + (right_top_val - left_top_val) * (x0 - left)
            bottom_lerp = left_bottom_val + (right_bottom_val - left_bottom_val) * (x0 - left)
            lerp = top_lerp + (bottom_lerp - top_lerp) * (y0 - top)
            if (lerp < threshold): # Binarize according to threshold
                dst[y,x] = 0
            else:
                dst[y,x] = 255
return dst

def cb_newpad(decodebin, decoder_src_pad, data):
print(“In cb_newpad\n”)
caps = decoder_src_pad.get_current_caps()
gststruct = caps.get_structure(0)
gstname = gststruct.get_name()
source_bin = data
features = caps.get_features(0)

# Need to check if the pad created by the decodebin is for video and not
# audio.
if (gstname.find("video") != -1):
    # Link the decodebin pad only if decodebin has picked nvidia
    # decoder plugin nvdec_*. We do this by checking if the pad caps contain
    # NVMM memory features.
    if features.contains("memory:NVMM"):
        # Get the source bin ghost pad
        bin_ghost_pad = source_bin.get_static_pad("src")
        if not bin_ghost_pad.set_target(decoder_src_pad):
            sys.stderr.write("Failed to link decoder src pad to source bin ghost pad\n")
    else:
        sys.stderr.write(" Error: Decodebin did not pick nvidia decoder plugin.\n")

def decodebin_child_added(child_proxy, Object, name, user_data):
print(“Decodebin child added:”, name, “\n”)
if name.find(“decodebin”) != -1:
Object.connect(“child-added”, decodebin_child_added, user_data)

if "source" in name:
    source_element = child_proxy.get_by_name("source")
    if source_element.find_property('drop-on-latency') != None:
        Object.set_property("drop-on-latency", True)

def create_source_bin(index, uri):
print(“Creating source bin”)

# Create a source GstBin to abstract this bin's content from the rest of the
# pipeline
bin_name = "source-bin-%02d" % index
print(bin_name)
nbin = Gst.Bin.new(bin_name)
if not nbin:
    sys.stderr.write(" Unable to create source bin \n")

# Source element for reading from the uri.
# We will use decodebin and let it figure out the container format of the
# stream and the codec and plug the appropriate demux and decode plugins.
uri_decode_bin = Gst.ElementFactory.make("uridecodebin", "uri-decode-bin")
if not uri_decode_bin:
    sys.stderr.write(" Unable to create uri decode bin \n")
# We set the input uri to the source element
uri_decode_bin.set_property("uri", uri)
# Connect to the "pad-added" signal of the decodebin which generates a
# callback once a new pad for raw data has beed created by the decodebin
uri_decode_bin.connect("pad-added", cb_newpad, nbin)
uri_decode_bin.connect("child-added", decodebin_child_added, nbin)

# We need to create a ghost pad for the source bin which will act as a proxy
# for the video decoder src pad. The ghost pad will not have a target right
# now. Once the decode bin creates the video decoder and generates the
# cb_newpad callback, we will set the ghost pad target to the video decoder
# src pad.
Gst.Bin.add(nbin, uri_decode_bin)
bin_pad = nbin.add_pad(Gst.GhostPad.new_no_target("src", Gst.PadDirection.SRC))
if not bin_pad:
    sys.stderr.write(" Failed to add ghost pad in source bin \n")
    return None
return nbin

def main(stream_paths, output_folder):
global perf_data
perf_data = PERF_DATA(len(stream_paths))
number_sources = len(stream_paths)

global folder_name
folder_name = output_folder

if path.exists(folder_name):

sys.stderr.write(“The output folder %s already exists. Please remove it first.\n” % folder_name)

sys.exit(1)

os.mkdir(folder_name)

print("Frames will be saved in ", folder_name)

# Standard GStreamer initialization
Gst.init(None)

# Create gstreamer elements */
# Create Pipeline element that will form a connection of other elements
print("Creating Pipeline \n ")
pipeline = Gst.Pipeline()
is_live = False

if not pipeline:
    sys.stderr.write(" Unable to create Pipeline \n")
print("Creating streammux \n ")

# Create nvstreammux instance to form batches from one or more sources.
streammux = Gst.ElementFactory.make("nvstreammux", "Stream-muxer")
if not streammux:
    sys.stderr.write(" Unable to create NvStreamMux \n")

pipeline.add(streammux)
for i in range(number_sources):

os.mkdir(folder_name + “/stream_” + str(i))

    print("Creating source_bin ", i, " \n ")
    uri_name = stream_paths[i]
    if uri_name.find("rtsp://") == 0:
        is_live = True
    source_bin = create_source_bin(i, uri_name)
    if not source_bin:
        sys.stderr.write("Unable to create source bin \n")
    pipeline.add(source_bin)
    padname = "sink_%u" % i
    sinkpad = streammux.get_request_pad(padname)
    if not sinkpad:
        sys.stderr.write("Unable to create sink pad bin \n")
    srcpad = source_bin.get_static_pad("src")
    if not srcpad:
        sys.stderr.write("Unable to create src pad bin \n")
    srcpad.link(sinkpad)

print("Creating Pgie \n ")
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
if not pgie:
    sys.stderr.write(" Unable to create pgie \n")
print("Creating tiler \n ")
tiler = Gst.ElementFactory.make("nvmultistreamtiler", "nvtiler")
if not tiler:
    sys.stderr.write(" Unable to create tiler \n")
print("Creating nvvidconv \n ")
nvvidconv = Gst.ElementFactory.make("nvvideoconvert", "convertor")
if not nvvidconv:
    sys.stderr.write(" Unable to create nvvidconv \n")
print("Creating nvosd \n ")
nvosd = Gst.ElementFactory.make("nvdsosd", "onscreendisplay")
if not nvosd:
    sys.stderr.write(" Unable to create nvosd \n")
if is_aarch64():
    print("Creating nv3dsink \n")
    sink = Gst.ElementFactory.make("nv3dsink", "nv3d-sink")
    if not sink:
        sys.stderr.write(" Unable to create nv3dsink \n")
else:
    print("Creating EGLSink \n")
    sink = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
    if not sink:
        sys.stderr.write(" Unable to create egl sink \n")

if is_live:
    print("Atleast one of the sources is live")
    streammux.set_property('live-source', 1)

streammux.set_property('width', 1344)
streammux.set_property('height', 832)
streammux.set_property('batch-size', number_sources)
streammux.set_property('batched-push-timeout', 4000000)
pgie.set_property('config-file-path', "dstest_segmask_config.txt")
pgie_batch_size = pgie.get_property("batch-size")
if (pgie_batch_size != number_sources):
    print("WARNING: Overriding infer-config batch-size", pgie_batch_size, " with number of sources ",
          number_sources, " \n")
    pgie.set_property("batch-size", number_sources)
tiler_rows = int(math.sqrt(number_sources))
tiler_columns = int(math.ceil((1.0 * number_sources) / tiler_rows))
tiler.set_property("rows", tiler_rows)
tiler.set_property("columns", tiler_columns)
tiler.set_property("width", TILED_OUTPUT_WIDTH)
tiler.set_property("height", TILED_OUTPUT_HEIGHT)

nvosd.set_property("display_mask", True) # Note: display-mask is supported only for process-mode=0 (CPU)
nvosd.set_property('process_mode', 0)

sink.set_property("sync", 0)
sink.set_property("qos", 0)

queue1=Gst.ElementFactory.make("queue","queue1")
queue2=Gst.ElementFactory.make("queue","queue2")
queue3=Gst.ElementFactory.make("queue","queue3")
queue4=Gst.ElementFactory.make("queue","queue4")
queue5=Gst.ElementFactory.make("queue","queue5")
pipeline.add(queue1)
pipeline.add(queue2)
pipeline.add(queue3)
pipeline.add(queue4)
pipeline.add(queue5)

print("Adding elements to Pipeline \n")
pipeline.add(pgie)
pipeline.add(tiler)
pipeline.add(nvvidconv)
pipeline.add(nvosd)
pipeline.add(sink)

print("Linking elements in the Pipeline \n")
streammux.link(queue1)
queue1.link(pgie)
pgie.link(queue2)
queue2.link(tiler)
tiler.link(queue3)
queue3.link(nvvidconv)
nvvidconv.link(queue4)
queue4.link(nvosd)
nvosd.link(queue5)
queue5.link(sink)

# create an event loop and feed gstreamer bus mesages to it
loop = GLib.MainLoop()
bus = pipeline.get_bus()
bus.add_signal_watch()
bus.connect("message", bus_call, loop)

tiler_sink_pad = tiler.get_static_pad("sink")
if not tiler_sink_pad:
    sys.stderr.write(" Unable to get src pad \n")
else:
    tiler_sink_pad.add_probe(Gst.PadProbeType.BUFFER, tiler_sink_pad_buffer_probe, 0)
    # perf callback function to print fps every 5 sec
    GLib.timeout_add(5000, perf_data.perf_print_callback)

# List the sources
print("Now playing...")
for i, source in enumerate(stream_paths):
    print(i, ": ", source)

print("Starting pipeline \n")
# start play back and listed to events		
pipeline.set_state(Gst.State.PLAYING)
try:
    loop.run()
except:
    pass
# cleanup
print("Exiting app\n")
pipeline.set_state(Gst.State.NULL)

def parse_args():
parser = argparse.ArgumentParser(prog=“deepstream_segmask.py”,
description=“deepstream-segmask takes multiple URI streams as input”
" and re-sizes and binarizes segmentation mask arrays to save to image")
parser.add_argument(
“-i”,
“–input”,
help=“Path to input streams”,
nargs=“+”,
metavar=“URIs”,
default=[“a”],
required=True,
)
parser.add_argument(
“-o”,
“–output”,
metavar=“output_folder_name”,
default=“out”,
help=“Name of folder to output mask images”,
)

args = parser.parse_args()
stream_paths = args.input
output_folder = args.output
return stream_paths, output_folder

if name == ‘main’:
stream_paths, output_folder = parse_args()
sys.exit(main(stream_paths, output_folder))

It seems that the bottleneck i mainly cased from the image_resize, when i am using the peoplesegnet the image resize does not “cost” very much since the dimensions of the destination image is smaller(propably 3030 pixels), but when i resize the image using my model costs because the destination image is much bigger(19201080). I am not very sure that this is the main reason but i am sure that the problem is caused in the image resize function.

Lastly the performace was measured based on from common.FPS import PERF_DATA of the code

yuweiw · June 14, 2023, 6:49am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Could you just attach the file of your code?

Topic		Replies	Views
Deepstream_facelandmark.app faster? DeepStream SDK deepstream	11	1374	March 28, 2023
Want segmentation model with 25 fps for jetson nano DeepStream SDK python , segmentation , deepstream	23	1343	December 19, 2023
Different segmentation mask between Jetson Nano and Desktop GPU DeepStream SDK	18	1972	March 31, 2021
Nvidia Deepstream6.3 custom segmentation model inference error: Aborted (core dumped) DeepStream SDK python	9	392	July 23, 2024
Python deepstream segmentation sample app running very slow DeepStream SDK deepstream	7	582	December 4, 2023
Some question about Deep stream 5 DeepStream SDK	41	2525	November 7, 2020
Lack of FPS after successfully deploy TLT to Deepstream. DeepStream SDK	17	1124	December 4, 2019
Error: gst-stream-error-quark: memory type configured and i/p buffer mismatch ip_surf 2 muxer 3 (1): gstnvstreammux.c(643) DeepStream SDK	11	1067	July 27, 2023
Nvidia/retinanet-examples Network running VERY slow on Jetson Xavier DeepStream SDK	8	1439	January 30, 2020
DeepStream 5.1, PyTorch, MobileNet SSD v1, retained, ONNX - poor performance DeepStream SDK	7	1904	April 29, 2021