Setup Information
• Hardware Platform (Jetson / GPU) : GPU—> RTX 4000 SFF
• DeepStream Version: 8.0 (docker container —> nvcr.io/nvidia/deepstream:8.0-gc-triton-devel**)**
• TensorRT Version: TensorRT.trtexec [TensorRT v100900]
• NVIDIA GPU Driver Version (valid for GPU only): 550.107.02 CUDA Version: 12.8
**• Issue Type( questions, new requirements, bugs):
Questions regarding Triton standalone server and deepstream communication
**
**• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
I am using deepstream-test1.cpp in the service maker directory. Changes made to include replacing starting part of pipeline with nvurisrcbin, and replacing config file path to
”/opt/nvidia/deepstream/deepstream-8.0/samples/configs/deepstream-app-triton-grpc/config_infer_plan_engine_primary.txt”.
The config has 1 change, in grpc the localhost:8001 has been replaced with an 127.0.0.1:8001
The application is fed an rtsp stream, and runs the normal Primary detector as the infer config file specifies.
I use nsys on the triton standalone server that I run (which may/may not be in same container**–> issue remains either way**) to notice that I get
**
The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy
operations to block and be executed synchronously. This leads to low GPU
utilization.
Suggestion: If applicable, use PINNED memory instead.
Duration (ns) Start (ns) Src Kind Dst Kind Bytes (MB) PID Device ID Context ID Green Context ID Stream ID API Name
752908 221944393 Pageable Device 11.011 498702 0 1 14 cudaMemcpyAsync_v3020
519305 236956500 Pageable Device 11.011 498702 0 1 34 cudaMemcpyAsync_v3020
269892 221228894 Pageable Device 4.652 498702 0 1 13 cudaMemcpyAsync_v3020
225188 221616868 Pageable Device 2.683 498702 0 1 15 cudaMemcpyAsync_v3020
216803 230406670 Pageable Device 4.652 498702 0 1 26 cudaMemcpyAsync_v3020
115778 241155478 Pageable Device 2.683 498702 0 1 37 cudaMemcpyAsync_v3020
736 221506242 Pageable Device 0.007 498702 0 1 13 cudaMemcpyAsync_v3020
736 230851093 Pageable Device 0.007 498702 0 1 26 cudaMemcpyAsync_v3020
544 222840983 Pageable Device 0.002 498702 0 1 14 cudaMemcpyAsync_v3020
512 237479965 Pageable Device 0.002 498702 0 1 34 cudaMemcpyAsync_v3020
Pageable memory even though everything is set to pinned memory in the pipeline. As such there is large number of mem-copy going on which leads to less gpu utilization.
Also increasing verbosity for the triton server leads to logs like
I0106 05:34:15.224568 498702 infer_handler.h:1540] “Done for ModelInferHandler, 0” I0106 05:34:15.224729 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 274 step START” I0106 05:34:15.224741 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryUnregister’, 275” I0106 05:34:15.224898 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 274 step COMPLETE” I0106 05:34:15.224904 498702 grpc_server.cc:356] “Done for CudaSharedMemoryUnregister, 274” I0106 05:34:15.248688 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 275 step START” I0106 05:34:15.248698 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryRegister’, 276” I0106 05:34:15.248931 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 275 step COMPLETE” I0106 05:34:15.248938 498702 grpc_server.cc:356] “Done for CudaSharedMemoryRegister, 275” I0106 05:34:15.249131 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step START” I0106 05:34:15.249139 498702 infer_handler.cc:675] “New request handler for ModelInferHandler, 0” I0106 05:34:15.249149 498702 infer_request.cc:133] “[request id: 275] Setting state from INITIALIZED to INITIALIZED” I0106 05:34:15.249152 498702 infer_request.cc:914] “[request id: 275] prepared: [0x0x7f3f90018130] request id: 275, model: Primary_Detector, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0\noriginal inputs:\n[0x0x7f3f90014628] input: input_1:0, type: FP32, original shape: [1,3,544,960], batch + shape: [1,3,544,960], shape: [3,544,960]\noverride inputs:\ninputs:\n[0x0x7f3f90014628] input: input_1:0, type: FP32, original shape: [1,3,544,960], batch + shape: [1,3,544,960], shape: [3,544,960]\noriginal requested outputs:\noutput_bbox/BiasAdd:0\noutput_cov/Sigmoid:0\nrequested outputs:\noutput_bbox/BiasAdd:0\noutput_cov/Sigmoid:0\n” I0106 05:34:15.249157 498702 infer_request.cc:133] “[request id: 275] Setting state from INITIALIZED to PENDING” I0106 05:34:15.249248 498702 infer_request.cc:133] “[request id: 275] Setting state from PENDING to EXECUTING” I0106 05:34:15.249256 498702 tensorrt.cc:390] “model Primary_Detector, instance Primary_Detector_0_0, executing 1 requests” I0106 05:34:15.249258 498702 instance_state.cc:359] “TRITONBACKEND_ModelExecute: Issuing Primary_Detector_0_0 with 1 requests” I0106 05:34:15.249260 498702 instance_state.cc:408] “TRITONBACKEND_ModelExecute: Running Primary_Detector_0_0 with 1 requests” I0106 05:34:15.249272 498702 instance_state.cc:1463] “Optimization profile default [0] is selected for Primary_Detector_0_0” I0106 05:34:15.249295 498702 instance_state.cc:938] “Context with profile default [0] is being executed for Primary_Detector_0_0” I0106 05:34:15.249461 498702 infer_response.cc:193] “add response output: output: output_cov/Sigmoid:0, type: FP32, shape: [1,4,34,60]” I0106 05:34:15.249466 498702 infer_handler.cc:871] “GRPC: unable to provide ‘output_cov/Sigmoid:0’ in GPU, will use CPU” I0106 05:34:15.249471 498702 infer_handler.cc:882] “GRPC: using buffer for ‘output_cov/Sigmoid:0’, size: 32640, addr: 0x7f409ca11770” I0106 05:34:15.249474 498702 pinned_memory_manager.cc:198] “pinned memory allocation: size 32640, addr 0x7f414a000090” I0106 05:34:15.249484 498702 infer_response.cc:193] “add response output: output: output_bbox/BiasAdd:0, type: FP32, shape: [1,16,34,60]” I0106 05:34:15.249485 498702 infer_handler.cc:871] “GRPC: unable to provide ‘output_bbox/BiasAdd:0’ in GPU, will use CPU” I0106 05:34:15.249493 498702 infer_handler.cc:882] “GRPC: using buffer for ‘output_bbox/BiasAdd:0’, size: 130560, addr: 0x7f409ca197d0” I0106 05:34:15.249494 498702 pinned_memory_manager.cc:198] “pinned memory allocation: size 130560, addr 0x7f414a008020” I0106 05:34:15.250288 498702 infer_handler.cc:1073] “ModelInferHandler::InferResponseComplete, 0 step ISSUED” I0106 05:34:15.250293 498702 infer_handler.cc:153] “GRPC free: size 32640, addr 0x7f409ca11770” I0106 05:34:15.250295 498702 infer_handler.cc:153] “GRPC free: size 130560, addr 0x7f409ca197d0” I0106 05:34:15.250331 498702 infer_request.cc:133] “[request id: 275] Setting state from EXECUTING to RELEASED” I0106 05:34:15.250333 498702 infer_handler.cc:642] “ModelInferHandler::InferRequestComplete” I0106 05:34:15.250333 498702 infer_handler.h:1510] “Received notification for ModelInferHandler, 0” I0106 05:34:15.250335 498702 instance_state.cc:1344] “TRITONBACKEND_ModelExecute: model Primary_Detector_0_0 released 1 requests” I0106 05:34:15.250336 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step COMPLETE” I0106 05:34:15.250337 498702 pinned_memory_manager.cc:226] “pinned memory deallocation: addr 0x7f414a000090” I0106 05:34:15.250339 498702 pinned_memory_manager.cc:226] “pinned memory deallocation: addr 0x7f414a008020” I0106 05:34:15.250341 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step FINISH” I0106 05:34:15.250343 498702 infer_handler.h:1540] “Done for ModelInferHandler, 0” I0106 05:34:15.250489 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 275 step START” I0106 05:34:15.250497 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryUnregister’, 276” I0106 05:34:15.250599 498702 grpc_server.cc:150] “Process for CudaSharedMemoryUnregister, rpc_ok=1, 275 step COMPLETE” I0106 05:34:15.250604 498702 grpc_server.cc:356] “Done for CudaSharedMemoryUnregister, 275” I0106 05:34:15.288864 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 276 step START” I0106 05:34:15.288877 498702 grpc_server.cc:100] “Ready for RPC ‘CudaSharedMemoryRegister’, 277” I0106 05:34:15.289124 498702 grpc_server.cc:150] “Process for CudaSharedMemoryRegister, rpc_ok=1, 276 step COMPLETE” I0106 05:34:15.289131 498702 grpc_server.cc:356] “Done for CudaSharedMemoryRegister, 276” I0106 05:34:15.289303 498702 infer_handler.cc:745] “Process for ModelInferHandler, rpc_ok=1, 0 step START” I0106 05:34:15.289310 498702 infer_handler.cc:675] “New request handler for ModelInferHandler, 0”
How can this “GRPC: unable to provide ‘output_bbox/BiasAdd:0’ in GPU, will use CPU” and be eliminated?
I have tried using –ipc=host when using separate containers for triton server and app-pipeline, but to no avail. Checked for triton server version issue by using a newer server version nvcr.io/nvidia/tritonserver:25.12-py3 rather than the already shipped one in the deepstream-8.0 container as well. Not it either.
There seems to be memory copy happening even though flags like enable_cuda_buffer_sharing: true are present.