Question (also related to another topic that @mchi had replied to, but rephrased):
How does one set up Deepstream to run on 1 or 2 or 4 GPUs on a 4 GPU server? I would like to do that and benchmark our hardware providing inference responses (of the same model) using either 1, 2 or 4 GPUs simultaneously.
I believe that one should / could use Triton to do this.
I would like to do this using the command :
So after pulling and running the Deepstream6.1-Triton docker (docker pull nvcr.io/nvidia/deepstream:6.1-triton) I then set up the chosen model in the triton models under samples. There is a config.pbtxt there which I use to have 2 instances on each GPU.
instance_group {
count: 2
kind: KIND_GPU
}
then there are two other config files that also refer to the GPU id:
source1_primary_retinanet_resnet18.txt
config_infer_primary_retinanet_resnet18.txt
The question is how to write those two config files so they can use 1, 2 or 4 GPU devices at the same time to process incoming requests from 16 sources (for example)?
Could you please let me how to do this within those 2 config files? What sections are needed in each?
wow, thank you @mchi this is very helpful and I can hopefully use this as a basis to work from.
If I understand correctly this triton server could then serve requests for bodypose2d utilizing both GPU0 and GPU3 as needed.
This is very close to what I would like to do with RetinaNet.
Let’s say that I run a Triton server for RetinaNet in the same way and then want to send it requests. There are two ideas I’d appreciate your clarification on:
The way that I thought this would work was with the Deepstream-Triton integration and that the command I listed above
would effectively do that (using those two files :
source1_primary_retinanet_resnet18.txt
config_infer_primary_retinanet_resnet18.txt
If service was available from triton server on GPUs0 and 3 would the Deepstream configuration direct the requests to a GPU based on the gpu-id in the config file? If so maybe I would have four pairs of configs, one to target each GPU with a load of requests? I could then run all four tasks at the same time to get the total loading?
OR maybe, once the server is running, a triton client (not invoked by the Deepstream command above) should send these requests?
Also, it sounds like the metamux that is under development that would bring outputs from multiple GPUs together (and split the requests to begin with) will be part of the Triton server package, is that correct? or is it going to be part of deepstream?
Once again, appreciate all of your help looking at these cases yesterday, Thanks!
Brandt
There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks