Custom Detection parser error with nvinferserver and custom python model with > 1 streams

Hello,
I am setting up a DeepStream application (using deepstream-app) to perform inference on a custom RNN model, with the help of triton. I am deploying on GPU using DeepStream 6.3, using docker container.
My triton model is, in fact, an ensemble of a python BLS model (performing some input preprocessing, calling TensorRT model and returning the result), and another python model to postprocess the RNN segmentation mask to a bounding rectangle.
I was setting up the triton model repo without DeepStream at first, and tested it for inference using external python script, with CUDA shared memory. I am also printing the final ensemble model result before returning it for debugging.
The triton model itself seems to be working perfectly fine. I encounter an issue when I am trying to parse my output format to DeepStream NvDsInferObjectDetectionInfo using custom detection parse function.

  1. First of all, I have tried setting output_mem_type : MEMORY_TYPE_CPU in the nvinferserver pbtxt config file. For a single stream the parsing is perfect. For 2 streams, there are some glitches - the “PP Boxes” lines are triton python outputs, all the others are printed in the custom parser:
Rect: nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
0 0 0 0 0 0.65
Rect: 243 390 62 273 1 0.65
**PERF:  11.28 (11.02)	11.28 (11.34)	
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [244,  59, 390, 273, 100,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 238 391 60 274 1 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [248,  61, 389, 270, 100,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
 0 0 0 0 0.65
Rect: 244 390 59 273 1 0.65
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [254,  66, 389, 267,  99,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 248 389 61 270 1 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [263,  70, 387, 266,  99,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 254 389 66 267 0.99 nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
0.65
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [264,  70, 387, 266,  99,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 263 387 70 266 0.99 0.65nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3

Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [262,  69, 387, 267,  99,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 264 387 70 266 0.99 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [260,  66, 387, 269,  99,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 262 387 69 267 0.99 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [260,  65, 387, 269,  99,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Rect: 260 387 66 269 0.99 0.65
^C** ERROR: <_intr_handler:140>: User Interrupted.. 

Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [245,  44, 391, 273, 102,   0]], device='cuda:0', dtype=torch.int32)
Rect: 0 0 0 0 0 0.65
Rect: 0 0 0 0 0 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [253,  59, 390, 275, 101,   0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Rect: 0 0 0 0 0 0.65
Rect: 0 0 0 0 0 0.65
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [256,  58, 389, 274, 100,   0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [254,  55, 389, 275, 100,   0]], device='cuda:0', dtype=torch.int32)
Rect: nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
**PERF:  11.37 (11.16)	11.37 (11.43)	
Quitting
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [250,  52, 389, 277, 101,   0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
[NvMultiObjectTracker] De-initialized
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[  0,   0,   0,   0,   0,   0],
        [249,  49, 390, 277, 101,   0]], device='cuda:0', dtype=torch.int32)
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
Rect: 1052293305 1052819650 1054398682 1052819650 1.054e+07 0.65
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3

The last lines, for example, are “bad” parsed values, but the others are completly fine!
2. Afterwards, I have tried modifying to output_mem_type : MEMORY_TYPE_GPU. I have spotted a very weird behavior - DeepStream runs for a few seconds, parses the first values well, and then gets stuck - egl sink freezed and no additional outputs on the terminal. This is the log:

nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [1x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [1x1], dataType:3, memType:3
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [1x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [1x1], dataType:3, memType:3
Stream IDS [1, 2]
Stream IDS [2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
PostProcessBoxesOutput
Rect: 0 0 0 0 0 0.65
Rect: 0 0 0 0 0 0.65
Stream IDS [1]
Converting seg to box torch.Size([1, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Rect: 0 0 0 0 0 0.65
Stream IDS [1, 2]
Converting seg to box torch.Size([1, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
**PERF:  0.00 (0.00)	13486.51 (3.97)	**<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<**
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
nvdsinferserver_custom_process.cpp:183extraInputProcess: primary input *SmokeBlsImageInput*, shape: [2x512x512x3], dataType:0, memType:1
nvdsinferserver_custom_process.cpp:184extraInputProcess: extra input *SmokeBlsCorrelationIdsInput*, shape: [2x1], dataType:3, memType:3
Stream IDS [1, 2]
Converting seg to box torch.Size([2, 256, 256, 1])
PP Boxes tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)
Stream IDS [1, 2]
**PERF:  0.00 (0.00)	0.00 (0.80)	
**PERF:  0.00 (0.00)	0.00 (0.44)

It seems like the video is running super fast - Watch the marked line (marked with <<<<<<<<<<<<< in the log) - the perf log is very high, while first perf measurement on CPU is **PERF: 11.62 (7.76) 17.44 (11.63) for example! I believe this issue is due to faulty memory access in the parser - Because I have tried adressing GPU buffer like a CPU one.
3. So I kept using output_mem_type : MEMORY_TYPE_GPU, Just with cudaMemCpy (v2 of the parser). The result is basically the same as (1) - Some values are just bad, but most seem valid.

My guess - some kind of memory dereferencing occurs, or otherwise, perhaps there is a race condition or memory re-usage that causes this issue.
parserV1.cpp (4.5 KB)
parserV2.cpp (4.8 KB)
My nvinferserver pbtxt config:
config_triton_inferserver_primary_smoke.pbtxt (1.6 KB)
My ensemble config:
config.pbtxt (1.1 KB)
My postprocess config and model.py (2nd ensemble model):
model.py (6.7 KB)
config.pbtxt (311 Bytes)

Thanks.

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)

• DeepStream Version

• JetPack Version (valid for Jetson only)

• TensorRT Version

• NVIDIA GPU Driver Version (valid for GPU only)

• Issue Type( questions, new requirements, bugs)

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hello,
We’re using DeepStream 6.3 Docker container (latest), Attaching Dockerfile and requirements(-base)?.txt to setup the environment.
I am using x86_64 mahcine with RTX A4000 on Ubuntu 22.04, Local GPU Driver version 535 latest (CUDA 12.2). Docker is also latest version.
this is a bug. Please note my first issue. I have attached the model configuration and I believe it can be reproduced from it and from the last postprocessor model.py code.
requirements.txt (29 Bytes)
requirements-base.txt (223 Bytes)
Dockerfile (1.2 KB)
Please note - “deepstream-deploy” is an internal package. You may ignore this requirement, it is a CLI tool only.

  1. please help to reproduce this issue. if possbile, please share the whole project including the simplified code, input video, models, configuration files. you can share it with the forum private email.
  2. nvinferserver and triton are opensource, you can add log to check if interested.

I will build a docker container and post a pull link ASAP.

@fanzh
I built and pushed an image to recreate the image.
Use the following commands:

xhost + # for display output using eglsink
docker run -it --gpus all --shm-size '2gb' -v /tmp/.X11-unix:/tmp/.X11-unix --ipc host --privileged -e DISPLAY=${DISPLAY} --entrypoint /smoke/run.sh public.ecr.aws/d7v4u7y1/captain-eye-pub-tests:mock-triton

I can’t start the docker , here is the log:
docker run --gpus device=0 --name fan-user -itd --shm-size ‘2gb’ -v /tmp/.X11-unix:/tmp/.X11-unix --ipc host --privileged -e DISPLAY=${DISPLAY} --entrypoint /smoke/run.sh public.ecr.aws/d7v4u7y1/captain-eye-pub-tests:mock-triton
docker exec -it 8691fd5f147cb4f95d6ffbbd1748ab3df334377fdfb2f36245820fada0cd1a39 bash
Error response from daemon: Container 8691fd5f147cb4f95d6ffbbd1748ab3df334377fdfb2f36245820fada0cd1a39 is not running

please see the last comment, and there is an ensemble sample in DS6.3 opt\nvidia\deepstream\deepstream-6.3\sources\TritonBackendEnsemble, can you modify this same to reproduce the issue?

The container runs just like that, perfectly fine.
Did you run xhost + to allow X11 connection? Are you running on Linux?
The container is probably crashing due to some issue. Check out with -it in the docker run arguments.

after do “xhost +” and use “-it”, I run it again. here is the error log,
error-0828.txt (1.8 KB)
can you use sample TritonBackendEnsemble to reproduce this issue?

I am trying to modify the example, and will update you ASAP.

@fanzh The problem now seems to be caused by enabling the sequence batcher in triton for the main TensorRT model (segmentation model).
I have commented out sequence_batching section in config.pbtxt file of the model and it didn’t get stuck. Re-enbabling it caused DeepStream to freeze and triton to crash with a “Segmentation fault (core dumped)” error.
My model config.pbtxt:

name : "topdown"
platform : "tensorrt_plan"
max_batch_size: 2
default_model_filename : "smoke333.onnx_fp32_min1_opt2_max2.engine"

sequence_batching {
  oldest
    {
      max_candidate_sequences: 4
    }
  state: [
    {
      input_name: "PreviousState"
      output_name: "leaky_re_lu_607"
      data_type: TYPE_FP32
      dims: [ 128, 128, 160 ]
      initial_state: {
       data_type: TYPE_FP32
       dims: [ 128, 128, 160 ]
       zero_data: true
       name: "initial state"
      }
    }
  ]
}

parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value:"no"
  }
}

My model spec is:


All model configurations are the same, I have commented out the C++ code to parse bboxes, and the txt configs:
config_infer_secondary_ensemble.txt (2.8 KB)
deepstream_app_config_triton_backend_ensemble.txt (5.4 KB)

Thanks for the sharing.

  1. what is the model used to do?
  2. why is config.pbtxt’s setting is inconsistent with model 's layer name?
  3. please help to reproduce this issue, please provide the simplified code, model, configuration and input data. Thanks!

python can’t process gpu data, setting “output_mem_type : MEMORY_TYPE_GPU” is not reasonable.

I’d like to point out at first, that my goal is to use a semantic segmentation NN with custom inputs to detect boxes, so I am using triton ensemble with bls to use the NN.
The triton pipeline is:

  • Ensemble:
    – Smoke BLS (Preprocessing, input construction, calling NN running with TRT backend using BLS)
    – Postprocess
  1. The model provided in the screenshot is a semantic segmentation model. The postprocessor model converts the masks to boxes (x,y,x,y,conf,class).
  • The segmentation model gets norrmalized RGBRGB 6-channel image with first 3 channels “current frames” and last 3 channels “background frame”.
  • The segmenatation model has LSTM loop (leaky_re_lu_607 output layer is fed back to PreviousState output layer).
  • It also has a control vector (InitVector) layer that changes the model’s behavior (It’s content is either full of zeros or first item 0.5 and the rest is zeros). BLS Model changes the content of that vector.
  • The output layer is (H,W,CLASS_CONF). The output layer is a “scaled” output mask. This example is a 1-class-segmentation model!
  1. I did not attach the config.pbtxt of the NN - because it has nothing beyond the model file name setting (dims are inferred). See below the full modified DS6.3 opt\nvidia\deepstream\deepstream-6.3\sources\TritonBackendEnsemble example archive.
  2. Here’s the example mentioned just above as you asked me to construct:
    TritonBackendEnsemble.tar.gz (72.3 KB)
    Note! I have removed the NN onnx/engine files from the archive because it’s a proprietary NN. You might want to create a mock model instead.

@fanzh Python seems to be working perfectly fine with CUDA shared memory and dlpack, using torch. Am I wrong here?
Attaching Relevant Triton Python Backend README section:

For PyTorch 2.0:

Thanks!

Sorry for the late replay. thanks for the sharing. I checked the code. but it is hard to get the root cause if can’t reproduce and debug.

the workflow is “model process-> python postprocessor->custom_parse_bbox_func”. can you check if the python postprocessor’s output data is correct? if it is correct. nvifnerserver plugin and lowlevel are opensource. you can add log to debug.

nvinfersever will let trtion do inference by calling triton api. nvinferserver and triton are opensource. can you narrow down this issue by adding log and simplifying code?

@fanzh
I got Deep Stream + Triton + Custom parser to work perfectly fine.

  1. I built the *.engine file with --fp16 flag, but apparently my build result was damaged, Switching to non-fp16 engine fixed the inference issue - the results inside triton were inconsistent. Despite the fact that I was building using TRTEXEC with fp16, the input/output layers’ dtype was still fp32, and then I switched to fp32 only.
  2. The timing issue was resolved. I don’t know what the issue was exactly, but using a minimally modified configuration of the provided TritonBackendEnsemble resolved the issue completely.
  3. Inference now worksd perfectly e2e including multiurisourcebin, REST, tracker and everything.
  4. As I mentioned above, setting: parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value:"no" } } in the python backend model config.pbtxt file, WAS changing the device of the tensorrs from cpu to cuda:0. So that should be noted as well.

Anyway, Thanks a lot for your help. I believe this thread may be closed now, and I’d love to send you any configuration that may be relevant if you’d like to investigate further anyway.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.