A30 Triton Inference slower than trtexec for increasing batch sizes

running retinanet (with efficientnet b0) on A30.

Inference throughput for the trtexec seems ok, but the deepstream with triton throughput numbers seem wrong. They are slightly higher for batch_size 1 but then dramatically lower for batch_size 8.

Any ideas why this might be ? (triton config file below) Thanks, Brandt

sudo docker run --gpus all -it --restart always  -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -v /home/dell:/home/dell -w /opt/nvidia/deepstream/deepstream-6.0 nvcr.io/nvidia/deepstream:6.0-triton
for batch size 1  trtexec = ~368            deepstream&trtiton = ~424
for batch size 8  trtexec = ~835            deepstream&trtiton = ~94
################################################################################
# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
################################################################################

[application]
enable-perf-measurement=1
##perf-measurement-interval-sec=5 ## original
perf-measurement-interval-sec=1
#gie-kitti-output-dir=kitti-trtis

[tiled-display]
enable=1
rows=1
#rows=1
columns=1
width=1280
height=720

gpu-id=0

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP
type=3
uri=file://../../streams/sample_1080p_h264.mp4
num-sources=1
#drop-frame-interval=2
gpu-id=0
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
## cudadec-memtype=0 ##original | 204.05 FPS
## cudadec-memtype=1 | 181.22 FPS

#disabled to match DeepStream config
###cudadec-memtype=2 | 202.20 FPS

[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=0
source-id=0
gpu-id=0
#nvbuf-memory-type=0

[sink1]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265
codec=1
##codec=2
sync=0
## added to be the same as DeepStream Config
#encoder type 0=Hardware 1=Software
enc-type=1

#iframeinterval=10
##bitrate=2000000 comemted to be same a yolov4
#output-file=/home/dell/Deepstream_6.0_Triton/output_videos/retinanet_ds-trtion_v6.0_outputvideo_delete_h264_1.mp4
output-file=./retinanet_ds-trtion_v6.0_outputvideo_delete_h264_1.mp4
source-id=0

[sink2]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming
type=4
#1=h264 2=h265
codec=1
sync=0
bitrate=4000000
# set below properties in case of RTSPStreaming
rtsp-port=8554
udp-port=5400

[osd]
enable=1
gpu-id=0
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
## comented to be the same DeepStream config
##show-clock=0
##clock-x-offset=800
##clock-y-offset=820
##clock-text-size=12
##clock-color=1;0;0;0


[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=0
batch-size=4
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height

## original setup
##width=1280
##height=720

## Configuration set to match DeepStream Config
width=1920
height=1080
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
## comented to be the same DeepStream config
#enable-padding=0
#nvbuf-memory-type=0

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
#(0): nvinfer; (1): nvinferserver
plugin-type=1
#infer-raw-output-dir=trtis-output
batch-size=4

gie-unique-id=1
config-file=config_infer_primary_retinanet_resnet18.txt

#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1

interval=0
## interval=0 # NOTE: SET INTERVAL TO 2 IN ORDER TO BOOST PERFORMANCE
gie-unique-id=1
##nvbuf-memory-type=0

[tracker]
enable=0
.
.
.
[tests]
file-loop=0

Can you share config_infer_primary_retinanet_resnet18.txt as well?

infer_config {
  unique_id: 1
  gpu_ids: [0]
  max_batch_size: 8
  backend {
    trt_is {
      model_name: "retinanet_resnet18_mod"
      version: -1
      model_repo {
        root: "../../trtis_model_repo"
        log_level: 2
        tf_gpu_memory_fraction: 0
        tf_disable_soft_placement: 0
        strict_model_config: true
      }
    }
  }

  preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 0
    frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
    frame_scaling_filter: 1
    normalize {
      scale_factor: 1.0
      channel_offsets: [0, 0, 0]
    }
  }

  postprocess {
    labelfile_path: "../../trtis_model_repo/retinanet_resnet18_mod/retinanet_labels.txt"
    detection {
      num_detected_classes: 5
      custom_parse_bbox_func: "NvDsInferParseCustomNMSTLT"
      nms {
        confidence_threshold: 0.3
        iou_threshold: 0.6
        topk : 100
      }
    }
  }

  extra {
    copy_input_to_host_buffers: false
  }

  custom_lib {
    path: "/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_infercustomparser.so"
  }
}
input_control {
  process_mode: PROCESS_MODE_FULL_FRAME
  interval: 0
}

output_control {
  detect_control {
    default_filter { bbox_filter { min_width: 32, min_height: 32 } }
  }
}

I believe the problem was that num-sources was = 1. after allowing num-sources to be higher, ended up with ~1904 for the DS-Triton running on the same efficientnet b0 network. also increased instances, but the 1904 was achieved with as little as 2 instances

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.