Multi GPU set up

Question (also related to another topic that @mchi had replied to, but rephrased):

How does one set up Deepstream to run on 1 or 2 or 4 GPUs on a 4 GPU server? I would like to do that and benchmark our hardware providing inference responses (of the same model) using either 1, 2 or 4 GPUs simultaneously.

I believe that one should / could use Triton to do this.
I would like to do this using the command :

deepstream-app -c source1_primary_retinanet_resnet18.txt

So after pulling and running the Deepstream6.1-Triton docker (docker pull nvcr.io/nvidia/deepstream:6.1-triton) I then set up the chosen model in the triton models under samples. There is a config.pbtxt there which I use to have 2 instances on each GPU.

instance_group {
   count:  2
   kind: KIND_GPU
 }

then there are two other config files that also refer to the GPU id:
source1_primary_retinanet_resnet18.txt
config_infer_primary_retinanet_resnet18.txt

The question is how to write those two config files so they can use 1, 2 or 4 GPU devices at the same time to process incoming requests from 16 sources (for example)?

Could you please let me how to do this within those 2 config files? What sections are needed in each?

Or do I need 8 config files to accomplish this?

Thank you very much,

Brandt

Hi @brandt33
This is an exmaple for running bodypose2d on GPU#1 and GPU#3 on a 4 GPUs platform (GPU#0 ~ GPU#3).
And, the triton command is:

$ docker run --gpus all -it --rm --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p 9000:8000 -p 9001:8001 -p 9002:8002 -v $(pwd)/triton_server_yolov4_bodypose2d/models:/models nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose=1

name: "bodypose2d"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
max_batch_size: 4
input [
    {
        name: "input"
        data_type: TYPE_FP32
        dims: [
            3,
            224,
            224
        ]
    }
]
output [
    {
        name: "266"
        data_type: TYPE_FP32
        dims: [
            18,
            56,
            56
        ]
    },
    {
        name: "268",
        data_type: TYPE_FP32
        dims: [
            42,
            56,
            56
        ]
    }
]
instance_group [
    {
      count: 3
      kind: KIND_GPU
      gpus: [ 1,3 ]
    }
]

wow, thank you @mchi this is very helpful and I can hopefully use this as a basis to work from.

If I understand correctly this triton server could then serve requests for bodypose2d utilizing both GPU0 and GPU3 as needed.

This is very close to what I would like to do with RetinaNet.

Let’s say that I run a Triton server for RetinaNet in the same way and then want to send it requests. There are two ideas I’d appreciate your clarification on:

  1. The way that I thought this would work was with the Deepstream-Triton integration and that the command I listed above
deepstream-app -c source1_primary_retinanet_resnet18.txt

would effectively do that (using those two files :
source1_primary_retinanet_resnet18.txt
config_infer_primary_retinanet_resnet18.txt

If service was available from triton server on GPUs0 and 3 would the Deepstream configuration direct the requests to a GPU based on the gpu-id in the config file? If so maybe I would have four pairs of configs, one to target each GPU with a load of requests? I could then run all four tasks at the same time to get the total loading?

  1. OR maybe, once the server is running, a triton client (not invoked by the Deepstream command above) should send these requests?

Also, it sounds like the metamux that is under development that would bring outputs from multiple GPUs together (and split the requests to begin with) will be part of the Triton server package, is that correct? or is it going to be part of deepstream?

Once again, appreciate all of your help looking at these cases yesterday, Thanks!
Brandt