I’m sorry, I’m absolutely not familiar with vGPUs, so please forgive me, if this question is stupid.
I’m having a DeepStream 6.4 inference app running successfully on an AWS ECS2 instance fitted with a T4 GPU. Of course a T4 is overkill for this kind of applications, for one customer only a time.
Multiplexing/Demultiplexing within the DeepStream appworks generally, but due to some unknown problem with nvstreamdemux
this causes an additional latency of 2 seconds (Details here Multisession inference, segmentation - #21 by foreverneilyoung), otherwise I would favour this solution.
I’m having a vague idea, that it could be possible to virtualize the power of the T4, so that more than one instances of my app could run, dockerized, in parallel.
Is it generally possible to distribute one T4 on an AWS EC2 as several vGPUs and let clients run on demand on such a vGPU from one or more docker container in parallel?
If so, what are the steps to go to setup such a scenario?
Sorry again, if this is completely OT.
Regards