DeepStream Triton gRPC example does not run with Deepstream Triton Docker images

Hi all,

I am trying to use deepstream and Triton inference servers in different computers/Nodes. The plan is to use a dedicated computer to handle inference and manage models, and multiple computers to handle multiple streams. I am able to open 2 containers on the same computer, and successfully run the example. But when I run it in 2 computers, I get

ERROR: infer_grpc_client.cpp:223 Failed to register CUDA shared memory.

Can you advice how to connect deepstream and triton inference server through grpc, as well as how to go through the tutorial?.

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Computer 1: RTX3090
Computer 2: RTX3090
• DeepStream Version
both computer: deepstream:6.1.1-triton
• NVIDIA GPU Driver Version (valid for GPU only)
515.86.01
• Issue Type( questions, new requirements, bugs)
bugs
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
Computer 1 (Server):

  1. Run the docker image

    docker run --gpus all -it --rm -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY --net=host --name=triton-server nvcr.io/nvidia/deepstream:6.1.1-triton

Inside the container

  1. cd samples

  2. ./prepare_ds_triton_model_repo.sh

  3. tritonserver --model-repository triton_model_repo/

Verify the application can communicate through grpc in different container

  1. docker run --gpus all -it --rm -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY --net=host --name=triton-client nvcr.io/nvidia/deepstream:6.1.1-triton

  2. cd /opt/nvidia/deepstream/deepstream-6.1/samples/configs/deepstream-app-triton-grpc

  3. deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.txt

Computer 2 (Client):

  1. Run the docker image

    docker run --gpus all -it --rm -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY --net=host --name=triton-client nvcr.io/nvidia/deepstream:6.1.1-triton
    Inside the container

  2. Change the grpc url to the server ip

    cd /opt/nvidia/deepstream/deepstream-6.1/samples/configs/deepstream-app-triton-grpc

  3. vim config_infer_plan_engine_primary.txt`

    grpc {
    url: “192.168.51.13:8001”
    # url: “localhost:8001”
    enable_cuda_buffer_sharing: true
    }

  4. deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.txt

Error Messages:

WARNING: infer_proto_utils.cpp:144 auto-update preprocess.network_format to IMAGE_FORMAT_RGB
INFO: infer_grpc_backend.cpp:169 TritonGrpcBackend id:1 initialized for model: Primary_Detector
ERROR: infer_grpc_client.cpp:223 Failed to register CUDA shared memory.
ERROR: infer_grpc_client.cpp:311 Failed to set inference input: failed to register CUDA shared memory region ‘inbuf_0x55558a0e0800’: failed to open CUDA IPC handle: invalid resource handle
ERROR: infer_grpc_backend.cpp:140 gRPC backend run failed to create request for model: Primary_Detector
ERROR: infer_trtis_backend.cpp:350 failed to specify dims when running inference on model:Primary_Detector, nvinfer error:NVDSINFER_TRITON_ERROR
0:00:00.127168214 1531 0x55558ab10920 ERROR nvinferserver gstnvinferserver.cpp:375:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in specifyBackendDims() <infer_grpc_context.cpp:154> [UID = 1]: failed to specify input dims triton backend for model:Primary_Detector, nvinfer error:NVDSINFER_TRITON_ERROR
0:00:00.127178191 1531 0x55558ab10920 ERROR nvinferserver gstnvinferserver.cpp:375:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in createNNBackend() <infer_grpc_context.cpp:210> [UID = 1]: failed to specify triton backend input dims for model:Primary_Detector, nvinfer error:NVDSINFER_TRITON_ERROR
0:00:00.127188907 1531 0x55558ab10920 ERROR nvinferserver gstnvinferserver.cpp:375:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in initialize() <infer_base_context.cpp:79> [UID = 1]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRITON_ERROR
0:00:00.127192287 1531 0x55558ab10920 WARN nvinferserver gstnvinferserver_impl.cpp:547:start:<primary_gie> error: Failed to initialize InferTrtIsContext
0:00:00.127194044 1531 0x55558ab10920 WARN nvinferserver gstnvinferserver_impl.cpp:547:start:<primary_gie> error: Config file path: /opt/nvidia/deepstream/deepstream-6.1/samples/configs/deepstream-app-triton-grpc/config_infer_plan_engine_primary.txt
0:00:00.127210681 1531 0x55558ab10920 WARN nvinferserver gstnvinferserver.cpp:473:gst_nvinfer_server_start:<primary_gie> error: gstnvinferserver_impl start failed
** ERROR: main:716: Failed to set pipeline to PAUSED
Quitting
ERROR from primary_gie: Failed to initialize InferTrtIsContext
Debug info: gstnvinferserver_impl.cpp(547): start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie:
Config file path: /opt/nvidia/deepstream/deepstream-6.1/samples/configs/deepstream-app-triton-grpc/config_infer_plan_engine_primary.txt
ERROR from primary_gie: gstnvinferserver_impl start failed
Debug info: gstnvinferserver.cpp(473): gst_nvinfer_server_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie
App run failed

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Many Thanks!

Why set “enable_cuda_buffer_sharing” as true? From the description it is used for local server:
Enable sharing of CUDA buffers with local Triton server for input tensors. If enabled, the input CUDA buffers are shared with the Triton server to improve performance. This feature should be enabled only when the Triton server is on the same machine..

Thanks! I comment out the line and it works fine. But now I observe the framerate and GPU utilization is much slower now.

i.e.
source30_1080p_dec_infer-resnet_tiled_display_int8.txt example drop from 30fps to less than 1 fps. (different machine)

source30_1080p_dec_infer-resnet_tiled_display_int8.txt example drop from 30fps to less than 15 fps. (same machine)

Does Triton inference server intend to work on the same host running deepstream? It will be very hard to manage multiple streams with multiple triton inference servers.

Thanks!

What type is your network card, 1G/2.5G or 10G?
You can use “iftop ethx” (replace ethx to your NIC name) to monitor the realtime bandwidth when the program is running.

Thanks, I am using a 1G network, and the network is indeed the bottleneck.

This is indeed bottleneck. Anything else we can help? We’ll close this topic if no support is needed.

For separate machines, yes.

I tried to run the server and client on the same machine with the same settings to eliminate the network bottleneck (which could be solved by buying a better switch). With enable_cuda_buffer_sharing: true commented out, I noticed a very large drop in the frame for the configuration source30_1080p_dec_infer-resnet_tiled_display_int8.txt. (from ~30fps to ~15fps, and eventually lag out). The log from nvtop show ~1-3 GB/s transfer for both RX and TX.

Is it correct that if I disable enable_cuda_buffer_sharing, the raw image will need to transfer back to the host, and back to the GPU for inference, which significantly slows down the fps? A naive calculation shows the transfer takes around 2x5.6GB/s bandwidth, which is much lower than the theoretical limit of PCI-E gen 4.

The purpose is to test if I can separate the model part from the deepstream application for rapid change of the model without modifying the deep stream application.

please find enable_cuda_buffer_sharing 's description in this link: Gst-nvinferserver — DeepStream 6.1.1 Release documentation
As the doc said, enable_cuda_buffer_sharing=true will improve performance when the Triton server is on the same machine.

thanks for the reply. Yes, I understand that. I am exploring options to use it on different machines, which the bottleneck seems not related to the network (there’s 10GB and even 100GB network switch). I notice the performance may be due to the data transfer to the GPU memory and wonder if there’s a way/option to optimize it.

when the server and client on the same machine, did you still meet “~30fps to ~15fps, and eventually lag out” issue if enable_cuda_buffer_sharing is true?

No, it runs smoothly at around 26 fps. Just like using trt as the backend. The bottleneck becomes the speed of the decoder, which the utilisation stays at 100% all the time.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

here are some option1 to improve performance.

  1. use sample_720p.mp4 as source.
  2. use nvinferserver 's interval property, please refer to Gst-nvinferserver — DeepStream 6.1.1 Release documentation.