• Hardware Platform (Jetson / GPU)
Xavier NX
• DeepStream Version
6.0
• JetPack Version (valid for Jetson only)
4.6.2
• Triton Server Version
2.19.0
• TensorRT Version
8.2.1.8
• Issue Type( questions, new requirements, bugs)
Bug
I am running a Yolov5 object detection model on Jetson with Triton via nvinferserver
plugin. When I run this model, the Triton server will perform inference for some number of random frames, and then crash with a segfault. Sometimes this happens almost immediately, and sometimes it happens after 10+ seconds.
I am using this project to post-process the Yolo model output. I have applied the patch referenced in this thread. This patch works on my dGPU system with no problems with Triton with the same model, the model runs fine and the gstreamer pipeline completes successfully. When I move to Jetson this new issue occurs.
I believe the error is again related to the GRPC server since if I remove the post-processing block, then the pipeline completes successfully and the log from the Triton server shows every frame processed successfully. If I inspect the returned data from the model via GRPC directly, the data is outside the bounds of expected values, and appears corrupted.
Here is the gstreamer pipeline I’m running:
gst-launch-1.0 filesrc location=/home/nick/Data/mp4s/20231019-164703-593_2310191643_SK5SS6UL_S.mp4 ! decodebin ! nvvideoconvert ! m.sink_0 nvstreammux name=m batch_size=1 width=1920 height=1080 ! nvvideoconvert ! nvinferserver config-file-path=/home/nick/nvinferserver_configs/vi_triton_jetson_pgie_tool_detection_config.yml ! nvdsosd ! nvvideoconvert ! nvv4l2h264enc ! h264parse ! mux.video_0 qtmux name=mux ! filesink location=test_triton_jetson_tool_det_other.mp4
Here is my nvinferserver
config file:
vi_triton_jetson_pgie_tool_detection_config.txt (1023 Bytes)
Here is my Triton model repository config file:
name: "tool_detection" # Must have the same name as the folder containing it
platform: "tensorrt_plan"
max_batch_size: 1
default_model_filename: "vic_yolov5n_09_29_2023.trt" # Will look specifically for a model with this name ignoring versioning
input [
{
name: "input"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [3,640,640]
}
]
output [
{
name: "boxes"
data_type: TYPE_FP32
dims: [25200,4]
},
{
name: "scores"
data_type: TYPE_FP32
dims: [25200,1]
},
{
name: "classes"
data_type: TYPE_FP32
dims: [25200,1]
}
]
model_warmup [
{
name: "Warmup_Request"
batch_size: 1
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [3,640,640]
random_data: true
}
}
}
]
The ONNX and Trt models for this Trt version are included here.
The post processing code is also included in that Google Drive link.
When I add additional printout to the post processing library and I inspect the returned classes I see something like this:
In loop over objects: 25193 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25194 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25195 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25196 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25197 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25198 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 25199 scores[b]= 0 classes[b]= 0 maxIndex= 0
In loop over objects: 0 scores[b]= 8.82692e-05 classes[b]= -1.06506e+34 maxIndex= -2147483648
Caught SIGSEGV
#0 0x0000007f7a544ef8 in __GI___poll (fds=0x55649b8d40, nfds=547514295176, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1 0x0000007f7a651f58 in () at /usr/lib/aarch64-linux-gnu/libglib-2.0.so.0
#2 0x0000005563d0cb90 in ()
Spinning. Please run 'gdb gst-launch-1.0 10781' to continue debugging, Ctrl-C to quit, or Ctrl-\ to dump core.
The returned data appears to be correct until all of a sudden there is a breakdown and the returned class value is some crazy number. Sometimes also the scores are wrong as well, like so:
In loop over objects: 25188 scores[b]= 2.53125e+35 classes[b]= 1 maxIndex= 1
In loop over objects: 25189 scores[b]= -9.8719e+32 classes[b]= 1 maxIndex= 1
In loop over objects: 25190 scores[b]= 3.05701 classes[b]= 1 maxIndex= 1
In loop over objects: 25191 scores[b]= 195.843 classes[b]= 2 maxIndex= 2
In loop over objects: 25192 scores[b]= -9.20844e+23 classes[b]= 0 maxIndex= 0
In loop over objects: 25193 scores[b]= -1.61735e+37 classes[b]= 0 maxIndex= 0
In loop over objects: 25194 scores[b]= 4.04623e-23 classes[b]= 0 maxIndex= 0
In loop over objects: 25195 scores[b]= -2.53121e-24 classes[b]= 0 maxIndex= 0
In loop over objects: 25196 scores[b]= 1.81606e-07 classes[b]= 0 maxIndex= 0
In loop over objects: 25197 scores[b]= -3.83942e-29 classes[b]= 0 maxIndex= 0
In loop over objects: 25198 scores[b]= 801514 classes[b]= 0 maxIndex= 0
In loop over objects: 25199 scores[b]= -8.41138e+11 classes[b]= 0 maxIndex= 0
In loop over objects: 0 scores[b]= -1.04696e+34 classes[b]= -1.06519e+34 maxIndex= -2147483648
Caught SIGSEGV
The seg fault then occurs once the maxIndex
gets set to a bad value, and a following array tries to access the wrong index. Guarding against these bad indices doesn’t work since they happen too frequently.
Here is a log block from Triton via log-verbose=3, which doesn’t appear to include any helpful information:
I0604 15:27:36.363289 10450 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0604 15:27:36.363744 10450 grpc_server.cc:3413] New request handler for ModelInferHandler, 7
I0604 15:27:36.363898 10450 model_repository_manager.cc:590] GetModel() 'tool_detection' version 1
I0604 15:27:36.364018 10450 model_repository_manager.cc:590] GetModel() 'tool_detection' version 1
I0604 15:27:36.364173 10450 infer_request.cc:675] prepared: [0x0x7ea8007c30] request id: 2, model: tool_detection, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ea8007f28] input: input, type: FP32, original shape: [1,3,640,640], batch + shape: [1,3,640,640], shape: [3,640,640]
override inputs:
inputs:
[0x0x7ea8007f28] input: input, type: FP32, original shape: [1,3,640,640], batch + shape: [1,3,640,640], shape: [3,640,640]
original requested outputs:
boxes
classes
scores
requested outputs:
boxes
classes
scores
I0604 15:27:36.364412 10450 tensorrt.cc:5381] model tool_detection, instance tool_detection, executing 1 requests
I0604 15:27:36.364505 10450 tensorrt.cc:1613] TRITONBACKEND_ModelExecute: Issuing tool_detection with 1 requests
I0604 15:27:36.364537 10450 tensorrt.cc:1672] TRITONBACKEND_ModelExecute: Running tool_detection with 1 requests
I0604 15:27:36.364602 10450 tensorrt.cc:2808] Optimization profile default [0] is selected for tool_detection
I0604 15:27:36.364819 10450 tensorrt.cc:2186] Context with profile default [0] is being executed for tool_detection
I0604 15:27:36.369558 10450 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,25200,4]
I0604 15:27:36.369896 10450 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in CPU_PINNED, will use CPU
I0604 15:27:36.370457 10450 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 403200, addr: 0x7f1f533550
I0604 15:27:36.370778 10450 infer_response.cc:166] add response output: output: scores, type: FP32, shape: [1,25200,1]
I0604 15:27:36.370948 10450 grpc_server.cc:2498] GRPC: unable to provide 'scores' in CPU_PINNED, will use CPU
I0604 15:27:36.371242 10450 grpc_server.cc:2509] GRPC: using buffer for 'scores', size: 100800, addr: 0x7f1f595c60
I0604 15:27:36.371453 10450 infer_response.cc:166] add response output: output: classes, type: FP32, shape: [1,25200,1]
I0604 15:27:36.371641 10450 grpc_server.cc:2498] GRPC: unable to provide 'classes' in CPU_PINNED, will use CPU
I0604 15:27:36.371883 10450 grpc_server.cc:2509] GRPC: using buffer for 'classes', size: 100800, addr: 0x7f1f5ae630
I0604 15:27:36.470593 10450 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0604 15:27:36.471071 10450 grpc_server.cc:2591] GRPC free: size 403200, addr 0x7f1f533550
I0604 15:27:36.471190 10450 grpc_server.cc:2591] GRPC free: size 100800, addr 0x7f1f595c60
I0604 15:27:36.471306 10450 grpc_server.cc:2591] GRPC free: size 100800, addr 0x7f1f5ae630
I0604 15:27:36.471945 10450 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0604 15:27:36.472027 10450 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETE
I0604 15:27:36.474334 10450 grpc_server.cc:2419] Done for ModelInferHandler, 5
I0604 15:27:36.474234 10450 tensorrt.cc:2665] TRITONBACKEND_ModelExecute: model tool_detection released 1 requests
Something definitely appears wrong with the GRPC data being sent back. I’m not sure if there is a way I can configure Triton to get around this or not, but would love any help that might be provided.