Scaling Deepstream app to multiple servers/nodes

In the past few months I have been designing and developing an IVA application using the excellent Deepstream SDK and the Python bindings. (after first experimenting with a Jetson Xavier NX and the Jetson libraries which spawned the inspiration for the application)

This application is run on-premise on a closed network, running on many camera streams and uses the NVOF and NVINFER elements to gather metadata of the incoming videostream (RTSP) and does further analysis using Python. For reasons I’m not going into, a single instance of this application is run in a Docker container so every videostream analysis is seperate from each other. So one stream per pipeline per docker container. We have developed a quite simple management script in Python that is able to manage the containers on request.

We are currently happily running about 20 to 30 instances (different streams and analysis) on a server with 10 x 2080Ti cards. The containers are able to share resources without biting each other and it seems to work fine and is quite reliable.

I now have a task of researching how we are going to scale this to many servers. We will be able to invest in some new hardware, possibly even with proper A100 cards.

But how do we scale this properly without losing our single-point-of-contact we have now with our controller/management script. It currently just uses the Docker API to run/stop/etc. the containers and that works amazingly well. I’m assuming I have to add a management layer in between, something like Docker Swarm, Kubernetes or some other cloud container management layer but how does this play well with sharing GPU resources? I’ve recently learned about the MIG capabilities of the newer generation cards but that still means I will be able to deploy only 7 instances per A100 GPU? The containers are now happily sharing a GPU and I’m wondering: would this still be possible with f.i. Kubernetes?

We would like to scale this somehow to 300 to even 500 instances, so running multiple servers is a must. Also the single point of contact with a management script is a must, together with the fact that the applications need to be running in separate containers (for proprietary reasons I’m not going into).

Is this even possible and if so: how? What should I read up on?

1 Like

By management layer do you mean orchestration? Are you looking for something that’s not handled by k8s?

Yes, I mean orchestration. As far as I know, it’s not possible for containers to share a GPU when one uses Kubernetes for orchestration. A pod can request a GPU and will get one assigned if available and then that GPU is taken. Hence the need for multi-instance-GPU’s. I get that. Also, MiG offers greater seperation between workloads etc. which is obviously needed in a multi-tenant cloud environment.

But in our case, we are running this application in a closed environment and all containers in essence run the same application, just different instances of it. They have a fairly low GPU memory footprint (2 - 3 GB) and we have no problem currently running multiple containers on one GPU. Our users can request a job in a web frontend we have made, and then a request is sent to a management script on the server that finds an available GPU (one that has 1 or less containers running on it) and it spins up a container where only that specific GPU is then made available inside the container (using --gpu device={}, essentialy). So every instance “sees” only one GPU device and we run 2 or sometimes 3 instance per GPU no problem. We have 10 GPU’s. Python has a great library for controlling Docker instances that made this all possible quite easily.

But now we run into the problem of needing a second server. Of course we could design and build functionality in our management script that actually takes two or more servers into account and connects to the different Docker API’s on those different servers but AFAIK that’s exactly what container orchestration was invented for. I’m trying NOT to re-invent the wheel, here ;) I would rather have our management script talk to (something like) K8s and let that orchestration layer figure out where to run the container.

But I’m wondering whether a solution for our specific problem exists or not. I think K8s would restrict us to “only” 7 instances per A100 GPU as that’s the upper limit of MIG currently. Our application would not even come close to saturating 1/7th of an A100 GPU. I think we should be able to run 4 or 5 instances on one GPU instance of 1g.10gb, giving us a total of 28 instances per card (100+ on a server with 4 cards). But what if even that is not enough. I need some sort of way to vertically scale this onto multiple servers.

Any insight would be greatly appreciated.

GPU sharing in k8s has been added in GPU Operator. Please check if this would suit your purpose:

We have yet to test this with DeepStream but plan to in the near future.


This seems to be EXACTLY what I’m looking for. Is this a new feature? Because this hasn’t come up yet in the quite extensive desk research I did into this topic. All I found was leveraging MIG, which is not what we need or want.

Thanks for sharing! We will definitely give this a try.

Yes it’s a new feature that just became GA in the last month. Enjoy and please let us know how it goes :)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.