Excess Memory-copy in standalone triton server. Deepstream--Triton server(grpc)

Setup Information

• Hardware Platform (Jetson / GPU) : GPU—> RTX 4000 SFF
• DeepStream Version: 8.0 (docker container —> nvcr.io/nvidia/deepstream:8.0-gc-triton-devel**)**
• TensorRT Version: TensorRT.trtexec [TensorRT v100900]
• NVIDIA GPU Driver Version (valid for GPU only): 550.107.02 CUDA Version: 12.8
**• Issue Type( questions, new requirements, bugs):

Questions regarding Triton standalone server and deepstream communication

**

**• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I am using deepstream-test1.cpp in the service maker directory. Changes made to include replacing starting part of pipeline with nvurisrcbin, and replacing config file path to
/opt/nvidia/deepstream/deepstream-8.0/samples/configs/deepstream-app-triton-grpc/config_infer_plan_engine_primary.txt”.

The config has 1 change, in grpc the localhost:8001 has been replaced with an 127.0.0.1:8001

The application is fed an rtsp stream, and runs the normal Primary detector as the infer config file specifies.

I use nsys on the triton standalone server that I run (which may/may not be in same container**–> issue remains either way**) to notice that I get
**

The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy
operations to block and be executed synchronously. This leads to low GPU
utilization.

Suggestion: If applicable, use PINNED memory instead.

Duration (ns)  Start (ns)  Src Kind  Dst Kind  Bytes (MB)   PID    Device ID  Context ID  Green Context ID  Stream ID        API Name

    752908   221944393  Pageable  Device        11.011  498702          0           1                           14  cudaMemcpyAsync_v3020
    519305   236956500  Pageable  Device        11.011  498702          0           1                           34  cudaMemcpyAsync_v3020
    269892   221228894  Pageable  Device         4.652  498702          0           1                           13  cudaMemcpyAsync_v3020
    225188   221616868  Pageable  Device         2.683  498702          0           1                           15  cudaMemcpyAsync_v3020
    216803   230406670  Pageable  Device         4.652  498702          0           1                           26  cudaMemcpyAsync_v3020
    115778   241155478  Pageable  Device         2.683  498702          0           1                           37  cudaMemcpyAsync_v3020
       736   221506242  Pageable  Device         0.007  498702          0           1                           13  cudaMemcpyAsync_v3020
       736   230851093  Pageable  Device         0.007  498702          0           1                           26  cudaMemcpyAsync_v3020
       544   222840983  Pageable  Device         0.002  498702          0           1                           14  cudaMemcpyAsync_v3020
       512   237479965  Pageable  Device         0.002  498702          0           1                           34  cudaMemcpyAsync_v3020

Pageable memory even though everything is set to pinned memory in the pipeline. As such there is large number of mem-copy going on which leads to less gpu utilization.

Also increasing verbosity for the triton server leads to logs like


I0106 05:34:15.224568 498702 infer_handler.h:1540] “Done for ModelInferHandler, 0”
I0106 05:34:15.224729 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 274 step START”
I0106 05:34:15.224741 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryUnregister’, 275”
I0106 05:34:15.224898 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 274 step COMPLETE”
I0106 05:34:15.224904 498702 grpc_server.cc:356] “Done for CudaSharedMemoryUnregister, 274”
I0106 05:34:15.248688 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 275 step START”
I0106 05:34:15.248698 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryRegister’, 276”
I0106 05:34:15.248931 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 275 step COMPLETE”
I0106 05:34:15.248938 498702 grpc_server.cc:356] “Done for CudaSharedMemoryRegister, 275”
I0106 05:34:15.249131 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step START”
I0106 05:34:15.249139 498702 infer_handler.cc:675] “New request handler for ModelInferHandler, 0”
I0106 05:34:15.249149 498702 infer_request.cc:133] “[request id: 275] Setting state from INITIALIZED to INITIALIZED”
I0106 05:34:15.249152 498702 infer_request.cc:914] “[request id: 275] prepared: [0x0x7f3f90018130] request id: 275, model: Primary_Detector, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0\noriginal inputs:\n[0x0x7f3f90014628] input: input_1:0, type: FP32, original shape: [1,3,544,960], batch + shape: [1,3,544,960], shape: [3,544,960]\noverride inputs:\ninputs:\n[0x0x7f3f90014628] input: input_1:0, type: FP32, original shape: [1,3,544,960], batch + shape: [1,3,544,960], shape: [3,544,960]\noriginal requested outputs:\noutput_bbox/BiasAdd:0\noutput_cov/Sigmoid:0\nrequested outputs:\noutput_bbox/BiasAdd:0\noutput_cov/Sigmoid:0\n”
I0106 05:34:15.249157 498702 infer_request.cc:133] “[request id: 275] Setting state from INITIALIZED to PENDING”
I0106 05:34:15.249248 498702 infer_request.cc:133] “[request id: 275] Setting state from PENDING to EXECUTING”
I0106 05:34:15.249256 498702 tensorrt.cc:390] “model Primary_Detector, instance Primary_Detector_0_0, executing 1 requests”
I0106 05:34:15.249258 498702 instance_state.cc:359] “TRITONBACKEND_ModelExecute: Issuing Primary_Detector_0_0 with 1 requests”
I0106 05:34:15.249260 498702 instance_state.cc:408] “TRITONBACKEND_ModelExecute: Running Primary_Detector_0_0 with 1 requests”
I0106 05:34:15.249272 498702 instance_state.cc:1463] “Optimization profile default [0] is selected for Primary_Detector_0_0”
I0106 05:34:15.249295 498702 instance_state.cc:938] “Context with profile default [0] is being executed for Primary_Detector_0_0”
I0106 05:34:15.249461 498702 infer_response.cc:193] “add response output: output: output_cov/Sigmoid:0, type: FP32, shape: [1,4,34,60]”
I0106 05:34:15.249466 498702 infer_handler.cc:871] “GRPC: unable to provide ‘output_cov/Sigmoid:0’ in GPU, will use CPU”
I0106 05:34:15.249471 498702 infer_handler.cc:882] “GRPC: using buffer for ‘output_cov/Sigmoid:0’, size: 32640, addr: 0x7f409ca11770”
I0106 05:34:15.249474 498702 pinned_memory_manager.cc:198] “pinned memory allocation: size 32640, addr 0x7f414a000090”
I0106 05:34:15.249484 498702 infer_response.cc:193] “add response output: output: output_bbox/BiasAdd:0, type: FP32, shape: [1,16,34,60]”
I0106 05:34:15.249485 498702 infer_handler.cc:871] “GRPC: unable to provide ‘output_bbox/BiasAdd:0’ in GPU, will use CPU”
I0106 05:34:15.249493 498702 infer_handler.cc:882] “GRPC: using buffer for ‘output_bbox/BiasAdd:0’, size: 130560, addr: 0x7f409ca197d0”
I0106 05:34:15.249494 498702 pinned_memory_manager.cc:198] “pinned memory allocation: size 130560, addr 0x7f414a008020”
I0106 05:34:15.250288 498702 infer_handler.cc:1073] “ModelInferHandler::InferResponseComplete, 0 step ISSUED”
I0106 05:34:15.250293 498702 infer_handler.cc:153] “GRPC free: size 32640, addr 0x7f409ca11770”
I0106 05:34:15.250295 498702 infer_handler.cc:153] “GRPC free: size 130560, addr 0x7f409ca197d0”
I0106 05:34:15.250331 498702 infer_request.cc:133] “[request id: 275] Setting state from EXECUTING to RELEASED”
I0106 05:34:15.250333 498702 infer_handler.cc:642] “ModelInferHandler::InferRequestComplete”
I0106 05:34:15.250333 498702 infer_handler.h:1510] “Received notification for ModelInferHandler, 0”
I0106 05:34:15.250335 498702 instance_state.cc:1344] “TRITONBACKEND_ModelExecute: model Primary_Detector_0_0 released 1 requests”
I0106 05:34:15.250336 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step COMPLETE”
I0106 05:34:15.250337 498702 pinned_memory_manager.cc:226] “pinned memory deallocation: addr 0x7f414a000090”
I0106 05:34:15.250339 498702 pinned_memory_manager.cc:226] “pinned memory deallocation: addr 0x7f414a008020”
I0106 05:34:15.250341 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step FINISH”
I0106 05:34:15.250343 498702 infer_handler.h:1540] “Done for ModelInferHandler, 0”
I0106 05:34:15.250489 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 275 step START”
I0106 05:34:15.250497 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryUnregister’, 276”
I0106 05:34:15.250599 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 275 step COMPLETE”
I0106 05:34:15.250604 498702 grpc_server.cc:356] “Done for CudaSharedMemoryUnregister, 275”
I0106 05:34:15.288864 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 276 step START”
I0106 05:34:15.288877 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryRegister’, 277”
I0106 05:34:15.289124 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 276 step COMPLETE”
I0106 05:34:15.289131 498702 grpc_server.cc:356] “Done for CudaSharedMemoryRegister, 276”
I0106 05:34:15.289303 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step START”
I0106 05:34:15.289310 498702 infer_handler.cc:675] “New request handler for ModelInferHandler, 0”


How can this “GRPC: unable to provide ‘output_bbox/BiasAdd:0’ in GPU, will use CPU” and be eliminated?

I have tried using –ipc=host when using separate containers for triton server and app-pipeline, but to no avail. Checked for triton server version issue by using a newer server version nvcr.io/nvidia/tritonserver:25.12-py3 rather than the already shipped one in the deepstream-8.0 container as well. Not it either.
There seems to be memory copy happening even though flags like enable_cuda_buffer_sharing: true are present.

  1. Standalone Triton server has its own mechanism, the log you post here are all Triton server logs. Triton server is open source, you can find answer from the code. server/src/grpc/infer_handler.cc at main · triton-inference-server/server.
  2. The DeepStream nvinferserver plugin is just a client of the Triton server, you choose GRPC as the communication and buffer exchange method between the client and server. grpc protocol stack is in CPU, it can only handle system memory but not GPU memory, the GPU to CPU and CPU to GPU memory copy is necessary.

Ok will check the code. Thanks.

Also, using the in-process version of nvinferserver will result in multiple model instances to be loaded into memory similar to nvinfer?
Say If i run multiple deepstream pipelines; all in different processes , they will individually load a model instance into memory to perform inference or will the processes share the same model that will be loaded once (like the stand-alone triton server)?

If its the former, what are the primary advantages or intended use cases of nvinferserver over nvinfer when Triton is not deployed as an external server?

Also: what is enable_cuda_buffer_sharing responsible for if gRPC is on CPU only?

It is the former. nvinferserver is similar to nvinfer when it is used with Triton C APIs.

This is for share input CUDA buffer between Triton client and Triton server. It is only for the input tensor buffers sharing.

I see,

  1. So input buffers can be shared to standalone triton server via enable_cuda_buffer_sharing but the outputs generated which are to be returned to client will go through a memory copy. Is my understanding correct?

  2. If that is true why does nsys report Pageable memory as discussed before? Shouldn’t the cuda tensor buffer be free of mem-copy for input and then go through copy while returning inference results?

  3. Also is there a way to get the input and output tensors to-and-from gpu to the same client deepstream pipeline after an inference done by a shared model (multi pipelines call the same model for inference). Minimizing memory-copy and model-instance-loading?

Yes, it is true.

Output tensor should be copied from GPU to CPU before sending back from server to client.

What do you mean? Are you asking for both input and output tensor buffers sharing between nvinferserevr and Triton Inference Server?

Can you consider to use C API mode of nvinferserver?

Yes, if both input/output tensors could be shared between Triton and nvinferserver,it would be helpful. Is it possible?

Considering C-API ver. won’t that mean I will have to deal multiple models being loaded into memory for each pipeline? Say I have 100 DS-8 pipelines running an nvinferserver (c-api mode), with 60-ish clients using the same primary detection model. Then won’t the model be loaded into memory individually for all 60 clients? (Which is the same as how nvinfer behaves).

What do you mean by 100 pipelines with 60 clients?

Something like

gst-launch-1.0 nvurisrcbin ! nvstreammux ! nvinferserver ! fakesink

So just 100 copies the above command.
(Or the deepstream-test1-app executable run multiple times in diff processes.)

Basically any generic deepstream pipeline which performs some kind of inference. and multiple copies of the same but with different sources (files/rtsp urls etc) in different processes.

Something along the lines of this.

for ((c=0; c<100; c++)); do
    GST_DEBUG=3 \
    /opt/nvidia/deepstream/deepstream-8.0/service-maker/sources/apps/cpp/deepstream_test1_app/build/deepstream-test1-app $URL \
    >"$LOG_FILE" 2>&1 &
PID_CONS=$!
CONS_PIDS+=(“$PID_CONS”)
done

Since you have only one model, you can run multiple sources in the same pipeline. Please refer to /opt/nvidia/deepstream/deepstream/service-maker/sources/apps/cpp/deepstream_test3_app

The example above has 1 model. So something like multiurisrcbin if i need to dynamically add or remove sources (rtsp-urls, files etc) works, right?

What If i have multiple models to choose from and may need to switch between them (whether it be different version of the same model or a different model entirely) then what approach is recommended?

Yes.

What is the relationship between the models?

Like an Int8 version or F16 version and so on.
Different models could be detection model being switched for a recognition model and so on.

Can you also tell us the reason for switching between the models dynamically? Even with Triton server, all models are loaded after the server start to run, so it is not for resource reason, right?