Hi
For XenApp / RDSH workloads, the 8GB started with the M10 (which was specifically created by NVIDIA to provide a low cost entry point for workloads like these). Best practice was (and still is) to assign the entire 8GB of a GPU to a single RDSH VM. That way, the VM gets the full power of the GPU and doesn’t have to share it with a competing VM via the Scheduler. The more VMs you add to the same GPU the less consistent the performance is as the resources now need to be Scheduled, this is especially true with RDSH, as you have multiple users per RDSH VM. The only way to then provide more consistent performance (bearing in mind that one user on the RDSH VM can still impact another) is to modify the Scheduler accordingly to provide consistent performance (trading peaky performance, for consistent performance but at a lower level). However, by doing that, neither VM will ever get the full power of the GPU, so the user experience will ultimately suffer. If you wanted to run the M10 and allocate 4GB to each RDSH VM, then each RDSH VM would only be getting 50% performance of an already not very powerful GPU shared between multiple users on each RDSH VM.
With the T4, that same scenario gets slightly worse. As the M10 has 8GB GPUs, running 2 4GB VMs on it only halves the GPUs performance. With the T4, even though it’s more powerful than a single GPU on an M10, it’s still a single 16GB GPU, so if you run 4 4GB RDSH VMs on it, what you’re actually doing is giving each RDSH VM a maximum of 25% performance of the GPU (assuming you’ve configured the Scheduler to "Fixed" to give consistent performance). Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time.
All of that, and that’s before we even get on to encoding. The Framebuffer is the only bit of the GPU that isn’t shared between other VMs, meaning that everything else is. If you overload the encoders, you’ll further impact user experience. Even though the encoders on the Turing GPUs are much more efficient than those used on the older architectures, there are less of them, so it’s still possible to overload them. A great way to do that is by running too many RDSH VMs on a GPU, as there is no hard limit to the amount of users (individual sessions that require encoding) per VM. This is in contrast to Desktop based VMs. As the Framebuffer is a fixed resource, each GPU can only support a finite amount of VMs. With the M10, forgetting that pointless 512MB profile, the maximum amount of VMs you can get per GPU is 8 (1GB). This means that the Scheduler only has to share the resources between a maximum of 8 VMs (users), unlike RDSH, where you can easily get 20+ users per VM.
For best results running RDSH on the T4, use the 8A profile, assign that to 2 RDSH VMs and change the scheduler to "Fixed" to give your users consistent performance (or as consistent as a VM shared by 20 - 25 users can be), that way the users on one RDSH VM will get 50% of a T4 without the ability to impact the 20 - 25 users on the second RDSH VM sharing the GPU, which will be as good / better than an entire 8GB M10.
If you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I’ll save the the trouble of running a POC … Don’t bother, the user experience won’t be good enough. You’ll need the configuration I’ve mentioned above :-) If 4 T4s don’t fit your budget, then use 2 M10s (per server) instead, again with the 8A profile (that’s 4 RDSH VMs per M10), but you’ll still need 2 DL380 servers to hit your number (3 servers, if you want to include N+1 resilience), assuming that you can get 25 users per RDSH VM to hit your 400 user peak.
Regards
MG