It seems that the deepstream examples use an different version of the triton server? This means that it has to load the model every time I run an app. With the standalone triton server it would load the model once and keep this model loaded and make the app loading quicker I assume?
Or am I missing some important aspect of quickly debugging a new app without loading the model each time?
I had originally posted this in the Jetson Xavier NX forum but it must have been moved. I had hoped the answer would be platform agnostic, but the details are:
⢠Hardware Platform (Jetson / GPU)
Jetson Xavier NX
⢠DeepStream Version
Any
⢠JetPack Version (valid for Jetson only)
Any
⢠TensorRT Version
Any
⢠Issue Type( questions, new requirements, bugs)
Question
⢠How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
N/A
⢠Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
Thanks for the reply. Iâm not sure what part of your link is most relevant - is it that you are saying that itâs not supported at all? Or only for the Xavier NX? I see the âTriton for Jetsonâ section doesnât mention the Xavier NX, however the release section for the triton standalone app doesnât list the compatibility to this level, it only mentions âJetsonâ: Releases ¡ triton-inference-server/server ¡ GitHub
In any case, I am able to run the standalone server on the Xavier NX.
If you could please clarify:
Does the deepstream app launch its own version of the Triton server? Is it invoking a âspecialâ deepstream version or is it launching the standalone one, albeit in the same process? Or does it not use the Triton server at all?
Is it possible (on any platform) to have the deepstream app use the standalone server?
If 2) is false, I would hope the nvinfer plugin would use the C-API interface to the Triton server and could therefore be modified slightly to not launch its own version and to use an already running version?
Does the deepstream app launch its own version of the Triton server? ==> DS5.1 is based on triton-2.5(20.11) interface, since itâs wrapped as GST plugin, it can only receive the GST buffer from other GST plugins.
Is it possible (on any platform) to have the deepstream app use the standalone server? ==> No
nvinfer plugin would use the C-API interface to the Triton server ==> nvinfer is based on TensorRT, nvinferserer is based on Triton.
would use the C-API interface to the Triton server and could therefore be modified slightly to not launch its own version and to use an already running version? =====> nvinferserver is not open source, you canât modify it.
If you want to use standalone Triton, as the link and the screenshot I shared above, you can install the Jetson Triton on Xavier NX.
To summarise as I understand it: deepstream cannot use the standalone triton server and therefore I cannot use the advantage of the server running constantly with the model already loaded, and I also cannot edit the plugin to allow this. This is the case for any platform.
I guess in a perfect world there would be a deepstream plugin to allow use of the Triton standalone server, using CUDA shared memory or some other fast, low latency zero copy way of inference - if you have a system to keep note of such requests I would appreciate it if you added this.
One last clarification - Is there another way of speeding up the deepstream startup time? Or must I always wait in some way for the model to load each time I start an app?
Sorry! Iâm not clear about what this mean. For DS nvinferserver (triton), it only needs load the model once at the initiliaztion stage, after that, it can use the loaded model for inference.
For exmaple, with Triton server to inference a TF saved model, doesnât it need to load the saved model in every boot up?
An example: I will open a deepstream app 4 times. Each time I close app then reopen I must wait for the model to be loaded (perhaps a few minutes). Each of the 4 times I will have to wait a few minutes for the model to load. I want to use deepstream since it gives the most optimised pipeline for maximum speed (is this true?), however when developing/debugging my own app based on the deepstream examples the long startup / model load time is a hindrance to fast debug cycles.
Therefore, perhaps naively, my thought is that if we could instead have deepstream using the standalone Triton server then the model is loaded once and stays loaded in the separate Triton process, meaning that the separate deepstream app will start quickly since it is not loading the model itself, and my debug cycles would be shorter.
I hope thatâs clearer - basically I am looking for the most efficient way to debug apps.
Edit: Perhaps I am misunderstanding due to my weak knowledge - do the deepstream triton examples spawn a Triton server that remains running once the deepstream app is closed? I have only recently come to appreciate (and more so with your help) that there are separate examples for triton and perhaps I havenât run one to see how it works.