Order within triton inference server python backend

ajithkumar.ak95 · March 25, 2024, 12:34pm

1). Regarding Batch Order in Triton Inference Server with Python Backend:
We’re developing a high-performance video analytics system using DeepStream with Triton Inference Server and a Python backend. Our requirement is to maintain a fixed order of channels within a batch. For example, camera 1 should correspond to batch[0], camera 2 to batch[1], and so on. However, we’re encountering issues where the batch length varies randomly and the order changes. What solutions or configurations can ensure a consistent batch order?

2). Mitigating Video Stuttering with Long Inference Times:
In our system, the execute method in the model.py takes up to 100ms to process a single frame batch. This leads to stuttering in the output video. Are there mechanisms, such as asynchronous inference or asynchronous overlay drawing using a buffer, that can help process all frames while avoiding stuttering?

Please find the model.py used in deepstream application.

# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import numpy as np
import cv2
import cupy as cp
import time



# triton_python_backend_utils is available in every Triton Python model. You
# need to use this module to create inference requests and responses. It also
# contains some utility functions for extracting information from model_config
# and converting Triton input/output types to numpy types.
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    # def initialize(self, args):
    #     """`initialize` is called only once when the model is being loaded.
    #     Implementing `initialize` function is optional. This function allows
    #     the model to initialize any state associated with this model.

    #     Parameters
    #     ----------
    #     args : dict
    #       Both keys and values are strings. The dictionary keys and values are:
    #       * model_config: A JSON string containing the model configuration
    #       * model_instance_kind: A string containing model instance kind
    #       * model_instance_device_id: A string containing model instance device ID
    #       * model_repository: Model repository path
    #       * model_version: Model version
    #       * model_name: Model name
    #     """

    #     # You must parse model_config. JSON string is not parsed here
    #     self.model_config = model_config = json.loads(args["model_config"])

    #     # Get OUTPUT0 configuration
    #     output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT_0")

    #     # Convert Triton types to numpy types
    #     self.output0_dtype = pb_utils.triton_string_to_numpy(
    #         output0_config["data_type"]
    #     )
 
    def initialize(self, args):
      pass
    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """
        responses = []

        logger = pb_utils.Logger
        # logger.log_error(f"Info Msg!:::::::::{test_module.value}")
        # logger.log_warn("Warning Msg!")
        # logger.log_error("Error Msg!")
        # logger.log_verbose("Verbose Msg!")
        for request in requests:
            
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            frame_cp = cp.fromDlpack(input_tensor.to_dlpack())
            logger.log_warn(f"Warning Msg!{frame_cp.device}")
            
            # frame = input_tensor.as_numpy()
            print(f"frame size is {frame_cp.shape}")
            frame = cp.asnumpy(frame_cp)
            batch_size = frame.shape[0]
            out_tensor = pb_utils.Tensor.from_dlpack(
                "OUTPUT0", input_tensor.to_dlpack()
            )

            try:
                frame_0 = frame[0].astype(np.uint8)
                cv2.imshow('Video_0', frame_0)
                
                if batch_size>1:
                    frame_1 = frame[1].astype(np.uint8)
                    cv2.imshow('Video_1', frame_1)
                cv2.waitKey(1)
            except Exception as e:
                logger.log_warn(f"exception:{e}")

            stats = np.array([
              [360, 780, 360, 360, -1]           # Middle rectangle
          ])
            

            #replicated_array=np.array([stats1,stats2])
            replicated_array = np.tile(stats, (batch_size, 1, 1))

            logger.log_warn(f"Warning Msg:{replicated_array.shape}")
            stats = replicated_array.astype(np.float32)

            out_tensor_1 = pb_utils.Tensor(
                "OUTPUT1", stats
            )
            # self.MEM1.Set(frame[0,:,:200,:200])

            responses.append(pb_utils.InferenceResponse([out_tensor,out_tensor_1]))
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print("Cleaning up...")

I will attach the minimum reproducible code via github url, as well,
This is the reference application: deepstream-rtsp-in-rtsp-out
Please checkout to ensemble branch.
github link

fanzh · April 1, 2024, 6:19am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)

• DeepStream Version

• JetPack Version (valid for Jetson only)

• TensorRT Version

• NVIDIA GPU Driver Version (valid for GPU only)

• Issue Type( questions, new requirements, bugs)

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

ajithkumar.ak95 · April 1, 2024, 12:43pm

[quote=“fanzh, post:5, topic:287272”]
• x86 Machine with dGPU

• DeepStream Version : deepstream-6.3

• JetPack Version (valid for Jetson only)

• TensorRT Version: 8.6.1.

• NVIDIA GPU Driver Version (valid for GPU only) : Driver Version: 535.86.10 CUDA Version: 12.2

• Issue Type( questions, new requirements, bugs): questions

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
deepstream-rtsp-in-rtsp-out in python samples is the sample app core.

• Requirement details: Docker file and and entire code which depends on the container environment is being shared. Please follow up.

fanzh · April 1, 2024, 3:10pm

how do you know the order changes? could you share some test log? or could you provide a simplified code based on DeepStream sample to reproduce this issue? is this(batchsize is 1 and batch0 correspond to camera1 when there is no camera0 data ) the case you mentioned?
which part of model.py code will consume too much time?

ajithkumar.ak95 · April 2, 2024, 1:50pm

I am cv2.imwrite image with shape[0] inorder to see the image, and it is changing the order, randomly. And if the maximum batch size is n, I am getting 0 to n-1 frames randomly.
model.py is having sophisticated python based procedure, including deep neural network element.

fanzh · April 3, 2024, 6:44am

the batch is generated by nvstreammux, which is not opensource. the order changing is because the muxer uses a round-robin algorithm to collect frames from the sources. nvinferserver is opensource, you can reorder the data by source_id in GstNvInferServerImpl::processFullFrame.
what are functionalities of python backend centerface1 and centerface2? noticing preprocess is set in nvinfersever configuration, the python backend will receive preprocessed data. why do you need to view preprocessed data in python backend, you can view or save frame after inference.
about “stuttering in the output video.”, do you mean “cv2.imshow”? about acceleration in python backend, please refer to pycuda or pynvvideocodec.

ajithkumar.ak95 · April 8, 2024, 8:59am

Thank you @fanzh for some insights.

I see the code is in cpp, is there any way to do changes on it and develop or controll it using python3.
Not just for seeing the data in backend, we are doing an sophisticated computer vision alogorithm in the backend, for development and optimisation of the algorithm, access to those tensors is necessary
The entire output video stutters, I think the the batch is not infered within reasonable time. Whenever I increase the interval, everything is good, but then we need interval as atleast 3.
With the following config, video is smooth,

input_control {
  process_mode: PROCESS_MODE_FULL_FRAME
  interval: 10
}

with 3 as interval, the output video stutters

ajithkumar.ak95 · April 8, 2024, 9:28am

Our computer vision algorithm that runs in python backed have following in it.

Deep Neural Network
multiple cv2 operations, filtering, thresholding etc…
Events capture based on scheduling
Generating alarms and overlay for all the videos in the batch
logging and archiving features

Please advice us on how to optimise the triton backend with python backend.

fanzh · April 8, 2024, 3:35pm

No, trtion is a new inference module, DeepStream nvinfersever passes the preprocessed data(tensor) to triton by triton interface. nvinfersever is a C version opensource.

nvinfersever interval means “Specifies the number of consecutive, batches to be skipped for inference. default is 0”.

optimising the triton backend with python backend is at triton side. this would be outside of DeepStream. you can get the reference in triton python_backend. You could try asking in the triton community.

ajithkumar.ak95 · April 9, 2024, 3:51am

Thank you @fanzh for some insights.

Regarding the input_controll::interval,

What I am saying is that, we need minimum batch_skip as 3, inorder to do the AI detection in a meaning full way.

But the system is performing bad, with video is stuttering with 3 as value.

or eg, please find the video.

WhatsApp Video 2024-04-09 at 9.18.45 AM.zip (2.3 MB)

fanzh · April 9, 2024, 8:01am

noticing the source is RTSP type, please check if it is related to source by the following commanad-line.

gst-launch-1.0 uridecodebin uri=rtsp://127.0.0.1:8554/test ! nveglglessink

to avoid the network issue, please play the output RTSP video in the same machine or use filesink to save mp4 instead of rtspsink.
if the stuttering problem still persists, please simplify the code to narrow down. for example,
if removing nvinferserver plugin, will the problem persist? if no, the problem should be in nvinferserver.
if removing all processing in the python backend, will the problem persist? if no, the problem should be in python backend.
then measure the time consumption in python backend to check which part code will consume much time.

ajithkumar.ak95 · April 9, 2024, 8:47am

The problem is in python backend. In all other cases, the video is smooth.

fanzh · April 9, 2024, 2:33pm

about " 1. Deep Neural Network", please find “GPU” in this gpu-support. where did you do postprocessing? there is a postprocess sample with cuda acclerlation.
2. about other applications, could you share the use cases? we usually let triton only do inference. preprocessing and postprocessing are done in on DeepStream side(nvinferserver). can you do these applications in async mode to avoid consuming too much time python backend.

fanzh · April 10, 2024, 6:00am

About " 5. logging and archiving features", there is a ready-made DeepStream plugin called nvmsgbroker. This plugin sends payload messages to the server using a specified communication protocol. please refer to deepstream-test4 for a nvmsgbroker sample.

ajithkumar.ak95 · April 11, 2024, 6:15am

We are doing the dnn stuff with in python backend, using tensorflow installed on python3 in python backend. We dont have any issues in post processing, we are able to convert tensors in to bounding boxes with ease.

Can we ahieve async mode in python backend, for eg, spending more time than frame rate allowed for a batch, and get smooth output video?
This is good feedback, we will try to use it.

fanzh · April 11, 2024, 6:39am

some ideas. some work which is not related to output video, such as events, logging, archiving, can be done in parallel.

ajithkumar.ak95 · April 11, 2024, 6:54am

Thats a good suggestion, thank you.

ajithkumar.ak95 · April 11, 2024, 6:55am

One doubt,
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvdspreprocess.html
can I do python based preprocessing here, it says .so file is needed.

fanzh · April 11, 2024, 7:07am

no. nvdspreprocess is opensource. it will use dlopen to load the so file. nvdspreprocess will use the default so, but you can use a new custom so if need to do custom preprocessing. please refer to sample deepstream-3d-action-recognition.

ajithkumar.ak95 · April 11, 2024, 8:45am

Thank you for the answers.