GRPC Data Corruption/Issue with Yolo Object Detection with Triton on Jetson

• Hardware Platform (Jetson / GPU)
Xavier NX
• DeepStream Version
6.0
• JetPack Version (valid for Jetson only)
4.6.2
• Triton Server Version
2.19.0
• TensorRT Version
8.2.1.8
• Issue Type( questions, new requirements, bugs)
Bug

I am running a Yolov5 object detection model on Jetson with Triton via nvinferserver plugin. When I run this model, the Triton server will perform inference for some number of random frames, and then crash with a segfault. Sometimes this happens almost immediately, and sometimes it happens after 10+ seconds.

I am using this project to post-process the Yolo model output. I have applied the patch referenced in this thread. This patch works on my dGPU system with no problems with Triton with the same model, the model runs fine and the gstreamer pipeline completes successfully. When I move to Jetson this new issue occurs.

I believe the error is again related to the GRPC server since if I remove the post-processing block, then the pipeline completes successfully and the log from the Triton server shows every frame processed successfully. If I inspect the returned data from the model via GRPC directly, the data is outside the bounds of expected values, and appears corrupted.

Here is the gstreamer pipeline I’m running:

gst-launch-1.0 filesrc location=/home/nick/Data/mp4s/20231019-164703-593_2310191643_SK5SS6UL_S.mp4 ! decodebin ! nvvideoconvert ! m.sink_0 nvstreammux name=m batch_size=1 width=1920 height=1080 ! nvvideoconvert ! nvinferserver config-file-path=/home/nick/nvinferserver_configs/vi_triton_jetson_pgie_tool_detection_config.yml ! nvdsosd ! nvvideoconvert ! nvv4l2h264enc ! h264parse ! mux.video_0 qtmux name=mux ! filesink location=test_triton_jetson_tool_det_other.mp4

Here is my nvinferserver config file:
vi_triton_jetson_pgie_tool_detection_config.txt (1023 Bytes)

Here is my Triton model repository config file:

name: "tool_detection" # Must have the same name as the folder containing it
platform: "tensorrt_plan"
max_batch_size: 1

default_model_filename: "vic_yolov5n_09_29_2023.trt" # Will look specifically for a model with this name ignoring versioning

input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [3,640,640]
  }
]
output [
  {
    name: "boxes"
    data_type: TYPE_FP32
    dims: [25200,4]
  },
  {
    name: "scores"
    data_type: TYPE_FP32
    dims: [25200,1]
  },
  {
    name: "classes"
    data_type: TYPE_FP32
    dims: [25200,1]
  }
]

model_warmup [
  {
    name: "Warmup_Request"
    batch_size: 1
    inputs: {
      key: "input"
      value: {
        data_type: TYPE_FP32
        dims: [3,640,640]
        random_data: true
      }
    }
  }
]

The ONNX and Trt models for this Trt version are included here.

The post processing code is also included in that Google Drive link.

When I add additional printout to the post processing library and I inspect the returned classes I see something like this:

In loop over objects: 25193 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25194 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25195 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25196 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25197 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25198 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25199 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 0 scores[b]= 8.82692e-05 classes[b]= -1.06506e+34 maxIndex= -2147483648
Caught SIGSEGV
#0  0x0000007f7a544ef8 in __GI___poll (fds=0x55649b8d40, nfds=547514295176, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x0000007f7a651f58 in  () at /usr/lib/aarch64-linux-gnu/libglib-2.0.so.0
#2  0x0000005563d0cb90 in  ()
Spinning.  Please run 'gdb gst-launch-1.0 10781' to continue debugging, Ctrl-C to quit, or Ctrl-\ to dump core.

The returned data appears to be correct until all of a sudden there is a breakdown and the returned class value is some crazy number. Sometimes also the scores are wrong as well, like so:

In loop over objects: 25188 scores[b]= 2.53125e+35 classes[b]= 1 maxIndex= 1
In loop over objects: 25189 scores[b]= -9.8719e+32 classes[b]= 1 maxIndex= 1
In loop over objects: 25190 scores[b]= 3.05701 classes[b]= 1 maxIndex= 1
In loop over objects: 25191 scores[b]= 195.843 classes[b]= 2 maxIndex= 2
In loop over objects: 25192 scores[b]= -9.20844e+23 classes[b]= 0 maxIndex= 0
In loop over objects: 25193 scores[b]= -1.61735e+37 classes[b]= 0 maxIndex= 0
In loop over objects: 25194 scores[b]= 4.04623e-23 classes[b]= 0 maxIndex= 0
In loop over objects: 25195 scores[b]= -2.53121e-24 classes[b]= 0 maxIndex= 0
In loop over objects: 25196 scores[b]= 1.81606e-07 classes[b]= 0 maxIndex= 0
In loop over objects: 25197 scores[b]= -3.83942e-29 classes[b]= 0 maxIndex= 0
In loop over objects: 25198 scores[b]= 801514 classes[b]= 0 maxIndex= 0
In loop over objects: 25199 scores[b]= -8.41138e+11 classes[b]= 0 maxIndex= 0
In loop over objects: 0 scores[b]= -1.04696e+34 classes[b]= -1.06519e+34 maxIndex= -2147483648
Caught SIGSEGV

The seg fault then occurs once the maxIndex gets set to a bad value, and a following array tries to access the wrong index. Guarding against these bad indices doesn’t work since they happen too frequently.

Here is a log block from Triton via log-verbose=3, which doesn’t appear to include any helpful information:

I0604 15:27:36.363289 10450 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0604 15:27:36.363744 10450 grpc_server.cc:3413] New request handler for ModelInferHandler, 7
I0604 15:27:36.363898 10450 model_repository_manager.cc:590] GetModel() 'tool_detection' version 1
I0604 15:27:36.364018 10450 model_repository_manager.cc:590] GetModel() 'tool_detection' version 1
I0604 15:27:36.364173 10450 infer_request.cc:675] prepared: [0x0x7ea8007c30] request id: 2, model: tool_detection, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ea8007f28] input: input, type: FP32, original shape: [1,3,640,640], batch + shape: [1,3,640,640], shape: [3,640,640]
override inputs:
inputs:
[0x0x7ea8007f28] input: input, type: FP32, original shape: [1,3,640,640], batch + shape: [1,3,640,640], shape: [3,640,640]
original requested outputs:
boxes
classes
scores
requested outputs:
boxes
classes
scores

I0604 15:27:36.364412 10450 tensorrt.cc:5381] model tool_detection, instance tool_detection, executing 1 requests
I0604 15:27:36.364505 10450 tensorrt.cc:1613] TRITONBACKEND_ModelExecute: Issuing tool_detection with 1 requests
I0604 15:27:36.364537 10450 tensorrt.cc:1672] TRITONBACKEND_ModelExecute: Running tool_detection with 1 requests
I0604 15:27:36.364602 10450 tensorrt.cc:2808] Optimization profile default [0] is selected for tool_detection
I0604 15:27:36.364819 10450 tensorrt.cc:2186] Context with profile default [0] is being executed for tool_detection
I0604 15:27:36.369558 10450 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,25200,4]
I0604 15:27:36.369896 10450 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in CPU_PINNED, will use CPU
I0604 15:27:36.370457 10450 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 403200, addr: 0x7f1f533550
I0604 15:27:36.370778 10450 infer_response.cc:166] add response output: output: scores, type: FP32, shape: [1,25200,1]
I0604 15:27:36.370948 10450 grpc_server.cc:2498] GRPC: unable to provide 'scores' in CPU_PINNED, will use CPU
I0604 15:27:36.371242 10450 grpc_server.cc:2509] GRPC: using buffer for 'scores', size: 100800, addr: 0x7f1f595c60
I0604 15:27:36.371453 10450 infer_response.cc:166] add response output: output: classes, type: FP32, shape: [1,25200,1]
I0604 15:27:36.371641 10450 grpc_server.cc:2498] GRPC: unable to provide 'classes' in CPU_PINNED, will use CPU
I0604 15:27:36.371883 10450 grpc_server.cc:2509] GRPC: using buffer for 'classes', size: 100800, addr: 0x7f1f5ae630
I0604 15:27:36.470593 10450 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0604 15:27:36.471071 10450 grpc_server.cc:2591] GRPC free: size 403200, addr 0x7f1f533550
I0604 15:27:36.471190 10450 grpc_server.cc:2591] GRPC free: size 100800, addr 0x7f1f595c60
I0604 15:27:36.471306 10450 grpc_server.cc:2591] GRPC free: size 100800, addr 0x7f1f5ae630
I0604 15:27:36.471945 10450 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0604 15:27:36.472027 10450 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETE
I0604 15:27:36.474334 10450 grpc_server.cc:2419] Done for ModelInferHandler, 5
I0604 15:27:36.474234 10450 tensorrt.cc:2665] TRITONBACKEND_ModelExecute: model tool_detection released 1 requests

Something definitely appears wrong with the GRPC data being sent back. I’m not sure if there is a way I can configure Triton to get around this or not, but would love any help that might be provided.

Thanks for the sharing! do you mean the app crashed on Jetson while running well on DGPU? Since 6.0 is an old version, can you reproduce this issue on the latest DeepStream7.0?

This model with nvinferserver and Triton ran fine with no issues on DGPU - it’s only when going to Jetson that the problem arose.

I would love to move to Deepstream 7.0 and start using new stuff, but alas our company has medical devices in the field which are on JetPack 4.6, so I’m trying to resolve the issue as such. (In the coming months I’m aiming to push for updates to use the newer stuff.)

As a follow on question to this, is system shared memory enabled by default when using Triton on Jetson with nvinferserver? It’s not entirely clear to me from the documentation whether I need to explicitly configure Triton to use system shared memory or not.

On dGPU I set enable_cuda_buffer_sharing: 1 which appears to use CUDA shared memory properly, but that is not available on Jetson. Triton lives on the same machine as the pipeline, so if I could forgo GRPC entirely in favor of shared memory then that might be a way around this.

please refer to TritonGrpcParams in opt\nvidia\deepstream\deepstream\sources\includes\nvdsinferserver_config.proto, enable_cuda_buffer_sharing is not supported on Jetson.
On the same machine, you can use nvinferserver native mode.

Yes I knew that, I was under the impression that shared memory and GRPC were different ways to communicate with Triton, but from my understanding now nvinferserver uses GRPC and shared memory at the same time, at least for sending the input tensor to Triton. It looks like only GRPC is used when returning the output tensors - can you confirm or deny?

On the original issue with corrupted returned GRPC data, have you had a chance to try and reproduce? Or possibly ask others in the team? I appreciate the help

  1. what do you mean about " It looks like only GRPC is used when returning the output tensors "? as the comment shown in nvdsinferserver_config.proto, If enabled, the input CUDA buffers are shared with the Triton server to improve performance. BTW, both nvinferserver and triton are opensource. you can check the code if interested.
  2. I will try to reproduce this issue.

I have looked through source code for nvinferserver and some triton stuff, but it is not so easy to figure out how things work. The files are complex and long with tons of different functionality, and difficult to understand in a lot of cases.

From my understanding when CUDA buffers are shared only the input tensors are shared to the server, not the output tensors, those are only returned back via GRPC. Regardless since CUDA buffers cannot be shared on Jetson it doesn’t matter for my problem. I was just unsure if Triton could return output tensors via shared memory and not GRPC, but from my understanding that is not how it works, it is only GRPC.

  1. On Orin with DeepStream7.0, I can’t reproduce this issue. I recreated an TRT engine by nvinfer mode because engine is bound with GPU. Here are relevant files:
    deepstream_app_config.txt (1.3 KB)
    log.txt (8.3 KB)
    vi_triton_jetson_pgie_tool_detection_config.txt (937 Bytes)
    config.pbtxt (797 Bytes)
  2. On Xavier with DS6.1.1, I still can’t reproduce this issue using the same cfg. here is the log ds611.txt (1.6 KB)
    .

Thank you for checking it out, I really appreciate it. Quick question - did you use the post-processing library I attached earlier in the thread? I assume so but just wanted to double check.

If so then all I can think of that’s different is either a) the slightly higher Deepstream version has fixed this issue, or b) the TRT engine created by nvinfer mode is somehow different than the engine I created manually with trtexec.

I will see about checking a higher DS version. Do you think there is any possibility b) has anything to do with this? I doubt it but thought I’d ask.

yes. I used this code DeepStream-Yolo without any code modification. please refer to my last comment for the whole configurations.
can DeepStream sample /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app-triton-grpc/ work well on the device?
could you share the trtexec command-line? if using the engine created by nvinfer, can the app work well?

Hmm without any code modification to that repo you should have run into an issue with the returned output tensor layers since gRPC can’t guarantee layers are returned in order (at least according to that other thread), perhaps the higher DS version fixes it.

I was able to run /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app-triton-grpc/ without issue. I’m not too surprised since the issue appears to be with returning the output tensors via gRPC for the custom post processing library, and not with sending the tensors to Triton via gRPC via Triton, but it was good to check regardless.

I tried using the engine created by nvinfer, and ran into the same issue. The trtexec command I used to create the model I am using is:

/usr/src/tensorrt/bin/trtexec --onnx=vic_yolov5n_09_29_2023.onnx --saveEngine=vic_yolov5n_09_29_2023.trt --explicitBatch

I think the next thing I have to check is a higher DS version, 6.1.1 or higher. I’ve asked my company to send me another Jetson that I can flash a higher version to, since I don’t want to overwrite the current stuff I have set up. Once I have that I will test this out. Thank you for the help, is it okay to leave this thread open for the time being before that while I wait?

OK. Thanks for the update!