Deepstream-Triton vs perf_analyzer throughputs

Running on the A30, DeepStream 6.0 and just wanted to check regarding the throughputs I am seeing.
With DeepStream-Triton running on a RetinaNet with ResNet18 backbone am seeing ~539 fps using the deepstream-triton with 1 source.

meanwhile with perf_analyzer am seeing:
~ 384 with concurrency of1,
~622 with concurrency of 2
~777 with concurrency of 3
~835 with concurrency of 4,5,6

do these make sense?

It works as expected, more fps means server can do more inferrence concurrently。please refer to the official doc: server/perf_analyzer.md at main · triton-inference-server/server · GitHub

Hi Fanzh - thanks for your comments. I just had another look at the perf_analyzer document in your link. One thing I should have mentioned is that I am using the perf_analyzer to try and double check the DeepStream-Triton timings and so have been running the perf_analyzer with “–service-kind=triton_c_api”.

My understanding from the document is that once the c_api is invoked there are no options for --shared-memory. So am wondering if running perf_analyzer with the “–service-kind=triton_c_api” with concurrency = 1 (throughput=384) should match more closely to the single source DeepStream-Triton case (throughput=539)? Am wondering what might cause that discrepancy?

source1_primary_retinanet_resnet18.txt

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=1


[tiled-display]
enable=1
rows=1
columns=1
width=1280
height=720

gpu-id=0

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP
type=3
uri=file://../../streams/sample_1080p_h264.mp4
num-sources=1
#drop-frame-interval=2
gpu-id=0
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
## cudadec-memtype=0 ##original | 204.05 FPS
## cudadec-memtype=1 | 181.22 FPS

[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=0
source-id=0
gpu-id=0
#nvbuf-memory-type=0

[sink1]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265
codec=1
##codec=2
sync=0
## added to be the same as DeepStream Config
#encoder type 0=Hardware 1=Software
enc-type=0

#iframeinterval=10
##bitrate=2000000 comemted to be same a yolov4
output-file=/home/dell/Deepstream_6.0_Triton/output_videos/retinanet_ds-trtion_v6.0_outputvideo_delete_h264_1.mp4
source-id=0

[sink2]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming
type=4
#1=h264 2=h265
codec=1
sync=0
bitrate=4000000
# set below properties in case of RTSPStreaming
rtsp-port=8554
udp-port=5400

[osd]
enable=0
gpu-id=0
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif

##show-clock=0
##clock-x-offset=800
##clock-y-offset=820
##clock-text-size=12
##clock-color=1;0;0;0


[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=0
batch-size=1
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height

width=1920
height=1080

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
#(0): nvinfer; (1): nvinferserver
plugin-type=1
#infer-raw-output-dir=trtis-output
batch-size=1
#interval=0
gie-unique-id=1
config-file=config_infer_primary_retinanet_resnet18.txt

#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=0
gie-unique-id=1
##nvbuf-memory-type=0

##TO DELETE
[tracker]
enable=0
tracker-width=640
tracker-height=384
gpu-id=0
## IOU TRACKER
ll-lib-file=/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_mot_iou.so
# DS-Triton Throuput with IOU ->**PERF:  243.74 (202.81)

[tests]
file-loop=0

config_infer_primary_retinanet_resnet18.txt

infer_config {
  unique_id: 1
  gpu_ids: [0]
  max_batch_size: 1
  backend {
    trt_is {
      model_name: "retinanet_resnet18_mod"
      version: -1
      model_repo {
        root: "../../trtis_model_repo"
        log_level: 2
        tf_gpu_memory_fraction: 0
        tf_disable_soft_placement: 0
        strict_model_config: true
      }
    }
  }

  preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 0
    frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
    frame_scaling_filter: 1
    normalize {
      scale_factor: 1.0
      channel_offsets: [0, 0, 0]
    }
  }

  postprocess {
    labelfile_path: "../../trtis_model_repo/retinanet_resnet18_mod/retinanet_labels.txt"
    detection {
      num_detected_classes: 5
      custom_parse_bbox_func: "NvDsInferParseCustomNMSTLT"
      nms {
        confidence_threshold: 0.3
        iou_threshold: 0.6
        topk : 100
      }
    }
  }

  extra {
    copy_input_to_host_buffers: false
  }

  custom_lib {
    path: "/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_infercustomparser.so"
  }
}
input_control {
  process_mode: PROCESS_MODE_FULL_FRAME
  interval: 0
}

output_control {
  detect_control {
    default_filter { bbox_filter { min_width: 32, min_height: 32 } }
  }
}

config.pbtxt

name: "retinanet_resnet18_mod"
platform: "tensorrt_plan"

default_model_filename: "saved.engine"

max_batch_size: 1
input [
  {
    name: "Input"
    format: FORMAT_NCHW
    data_type: TYPE_FP32
    dims: [ 3, 384, 1248 ]
  }
]
output [
  {
    name: "NMS"
    data_type: TYPE_FP32
    dims: [ 1, 200, 7 ]
  },
  {
    name: "NMS_1"
    data_type: TYPE_FP32
    dims: [ 1, 1, 1 ]
  }
]

# Specify GPU instance.
instance_group {
  count: 2
  gpus: 0
  kind: KIND_GPU
}

optimization { execution_accelerators {
  gpu_execution_accelerator : [ { name : "tensorrt" } ]
                }}

## Pending to Test
##dynamic_batching {
##    preferred_batch_size: [ 8 ]
##}

if I then change:

# Specify GPU instance.
instance_group {
  count: 2
  gpus: 0
  kind: KIND_GPU
}

to

# Specify GPU instance.
instance_group {
  count: 1
  gpus: 0
  kind: KIND_GPU
}

the throughput drops to ~484 !!

so the 484 is closer to the 384 that I’m seeing from the perf_analyzer with concurrency=1 but it is still substantially more. (484 >> 384) wondering why that is?

perf_analyzer -m retinanet_resnet18_mod --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/opt/nvidia/deepstream/deepstream-6.0/samples/trtis_model_repo --concurrency=1:7

Hi Fanzh, Thanks again for your reply before. I posted some more info in additional comments, was wondering if you might be able to take a look at that. The main question I have is why the perf_analyzer throughput (with concurrency 1) would come in less than that of the DeepStream-Combination ? Any suggestions appreciated, Thanks, Brandt

what do you mean about "DeepStream-Combination "?

Hi Fanzh,

thanks for your note. We had been talking earlier with Carlos and he had suggested using the perf_analyzer to cross check our throughput numbers that we were seeing using the DeepStream-Triton combination. So when I said combination I should have said the DeepStream-Triton combination that runs from the DeepStream Triton docker. The comments above are referring to the discrepancy between the perf_analyzer throughputs and the DeepStream-Triton combination throughputs, where the inference requested by DeepStream is carried out using the Triton server.

Thanks,

Brandt

please refer to the link above, perf_analyzer is a tool which can measures the throughput and latency of those requests.
not understand your test, By default, perf_analyzer will send random data to server, the other test is using mp4, please use the same input and model if want to do comparation.

Ah ok great, thanks, will check they are using the same size image to make apples to apples comparison. That must be it.

Internal Use - Confidential

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.