Looking for advice on optimal config for latest-gen Citrix Xenapp vGPU solution

Hi,

I’m designing a new setup for Nvidia Grid vAPPS user density RDSH based sessions based on the following hardware:

HP DL380 G10 with dual Xeon 6254
best GPU. I suppose Tesla T4 right now for more flexibility, codec support and future-proofness. Ideal I would have wanted something like T6 or T8 (like a new gen M10 with 64GB mem) but it doesn’t exist ?
10Gb backbone network to connect everything.

I figure I will end up with 16 smaller or 8 larger Citrix virtual Xenapp servers per physical host server.

  1. What can you recommend ?

  2. Are there any reference config or casestudy documents available ?

  3. How well can I scale per physical T4 GPU adapter ?

  4. How is vGPU memory used in this RDSH model ? Does it limit the number of maximum Xenapp sessions per virtual Citrix server (which will have 1GB or 2GB memory assigned to it’s virtual machine) ? Does it limit the resolutions they can run their Xenapp sessions in ?

  5. Do I need VMware or can Xenserver work just as fine for Nvidia grid ? (VMware will require enterprise licenses etc…) Are their limitations ?

Thx in advance for any replies

Th

Hi

You can use either XenServer or vSphere for the Hypervisor.

XenServer licensing is included with XenDesktop / XenApp so it’s a bit cheaper, but at the end of the day you’re stuck using XenCenter to manage your deployment (which is an extremely out dated and massively under developed management console). The only nice thing about XenCenter, is the way in which it visualises the GPUs and makes it really easy to see where they are allocated. This is a feature vSphere is lacking.

By contrast with vSphere, you’ll need Enterprise licensing for the Hypervisor and at least Standard licensing for vCenter. However, you’re getting a platform that’s much much nicer to manage and support with better overall functionality. If you plan to use Netscaler VPX appliances, be aware that XenServer does not support live migration of them (they migrate and crash) whereas vSphere does support live migration of them.

Honestly, apart from vSphere’s poor vGPU management, the only reason you’d choose XenServer over vSphere is because it’s cheaper. If they were the same price or a lot closer, it would be vSphere every time without hesitation.

As for your T4s, change the Scheduler on the GPUs to "Fixed" and allocate 8GB (using the 8A vGPU Profile) to each of your XenApp VMs. You’ll get 2 XenApp VMs per GPU, allocate your CPU and RAM resources to each XenApp VM accordingly based on that. The downside of using XenApp, is that you have no control of how the Framebuffer is allocated per user. One user could consume the entire 8GB if their workload required it. If you want more granular control and a fixed Framebuffer allocation per user, then you need to use XenDesktop.

How many concurrent users do you plan to support?

Regards

MG

Hi Mr GRID,

thanks for your response. You seem like just the right man to talk to !

I’m planning for 350-400 concurrent users capacity-wise. All intended worker profiles are office level. So I’m looking to offload ‘normal’ applications. We don’t have autocad or other GPU power users to service.

Outside of the GPU part I think I’ll have 8 or 16 virtual Xenapp servers per physical host in the current design.

Since the T4 cards are very costly and PCIE slots are limited in general I was hoping to be able to service them with 1 or 2 T4 cards per physical host maximum. What is the exact criterium here or how do I calculate the need for specific vGPU profiles ? How does that work ?

I also remember to have seen an overview of all possible vGPU profiles for M10 last year but now I cannot seem to find the same document for T4.

ps: for some odd reason I’m not getting update emails of your reply (not in spam folder either) so good thing I checked back manually

Hi

With XenApp / RDSH, it’s relatively strait forwards to design as the vGPU configuration options are pretty standard. Basically, 8GB is the number you should "typically" be looking to use for XenApp, and the configuration options would be as follows:

The most cost effective (cheapest) solution is to still use the M10 for XenApp deployments. The M10 has 4 GPUs on a single board, and you’ll put 2 of those boards in a single 2U server. This will give you the capability of running 8 XenApp VMs per Server, each with an 8A vGPU profile.

A more future proof configuration would be to replace 1 M10 with 2 T4s (or 2 M10s with 4 T4s). This will give you the same amount of Framebuffer to share between your XenApp VMs, but the T4 will provide better performance and functionality and the T4 is more power efficient as well. Then (as mentioned earlier) change the vGPU scheduler on the T4 to "Fixed" and allocate the same 8A vGPU profile to the XenApp VMs. You don’t need to change the Scheduler on the M10, as each XenApp VM has its own dedicated GPU.

You will want more than 2 T4s per server, or you’ll need more servers to cater for that amount of users. So scale up, not out. If you have 400 users and want 16 XenApp VMs that equates to 25 users per XenApp VM. With 4 T4s installed, you’ll have 8 XenApp VMs per DL380.

To account for N+1 (physical resilience, image updates, user load balancing) you’re going to need 3 DL380 servers each with 4 T4s installed to cater for those numbers, assuming that you can actually support 25 users per XenApp VM without impacting the experience. User density on the XenApp VMs will vary depending on utilisation, so it’s very important to test first in a POC before finalising any specifications or quantity of Servers required.

The DL380 G10 will actually support up to 5 T4s (Virtual GPU Certified Servers | NVIDIA GRID), which means you could host 10 XenApp VMs per DL380. This would reduce your user density per XenApp VM down to 20, which may be a better number to target.

vGPU Profile options for the M10 and T4 are available here:

M10: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#vgpu-types-tesla-m10
T4: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#vgpu-types-tesla-t4

But as said, the best profile for higher density XenApp VMs will be to use the 8A profile.

If you’re supporting 10 XenApp VMs / 200 users per Server, don’t forget to consider the CPU. You should be looking at something with more Cores, rather than higher Clock. Here are some better options to consider:

Platinum 8280: Intel Xeon Platinum 8280 Processor 38.5M Cache 2.70 GHz Product Specifications
Platinum 8260: Intel Xeon Platinum 8260 Processor 35.75M Cache 2.40 GHz Product Specifications
Gold 6252N: Intel Xeon Gold 6252N Processor 35.75M Cache 2.30 GHz Product Specifications

Due to the nature of the workload, you don’t need such a high Clock, and having more Cores will reduce the CPU overcommit.

As a starting point for your POC, you should be looking at 8 vCPUs / 32GB RAM / 8A vGPU with the aim of supporting 20-25 users per XenApp VM. Your VMs should be running on All Flash / SSD storage as well (not cheap / slow spinning disks). You can then monitor the hardware utilisation for each component and tailor the specs to suit the user experience, performance and user density.

Regards

MG

Thanks for the comprehensive answer. Very clear except 1 thing:

why is 8GB the number you should "typically" be looking to use for XennApp vGPU profiles ? What is the impact if I choose ending up with 4GB for instance ?

Hi

For XenApp / RDSH workloads, the 8GB started with the M10 (which was specifically created by NVIDIA to provide a low cost entry point for workloads like these). Best practice was (and still is) to assign the entire 8GB of a GPU to a single RDSH VM. That way, the VM gets the full power of the GPU and doesn’t have to share it with a competing VM via the Scheduler. The more VMs you add to the same GPU the less consistent the performance is as the resources now need to be Scheduled, this is especially true with RDSH, as you have multiple users per RDSH VM. The only way to then provide more consistent performance (bearing in mind that one user on the RDSH VM can still impact another) is to modify the Scheduler accordingly to provide consistent performance (trading peaky performance, for consistent performance but at a lower level). However, by doing that, neither VM will ever get the full power of the GPU, so the user experience will ultimately suffer. If you wanted to run the M10 and allocate 4GB to each RDSH VM, then each RDSH VM would only be getting 50% performance of an already not very powerful GPU shared between multiple users on each RDSH VM.

With the T4, that same scenario gets slightly worse. As the M10 has 8GB GPUs, running 2 4GB VMs on it only halves the GPUs performance. With the T4, even though it’s more powerful than a single GPU on an M10, it’s still a single 16GB GPU, so if you run 4 4GB RDSH VMs on it, what you’re actually doing is giving each RDSH VM a maximum of 25% performance of the GPU (assuming you’ve configured the Scheduler to "Fixed" to give consistent performance). Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time.

All of that, and that’s before we even get on to encoding. The Framebuffer is the only bit of the GPU that isn’t shared between other VMs, meaning that everything else is. If you overload the encoders, you’ll further impact user experience. Even though the encoders on the Turing GPUs are much more efficient than those used on the older architectures, there are less of them, so it’s still possible to overload them. A great way to do that is by running too many RDSH VMs on a GPU, as there is no hard limit to the amount of users (individual sessions that require encoding) per VM. This is in contrast to Desktop based VMs. As the Framebuffer is a fixed resource, each GPU can only support a finite amount of VMs. With the M10, forgetting that pointless 512MB profile, the maximum amount of VMs you can get per GPU is 8 (1GB). This means that the Scheduler only has to share the resources between a maximum of 8 VMs (users), unlike RDSH, where you can easily get 20+ users per VM.

For best results running RDSH on the T4, use the 8A profile, assign that to 2 RDSH VMs and change the scheduler to "Fixed" to give your users consistent performance (or as consistent as a VM shared by 20 - 25 users can be), that way the users on one RDSH VM will get 50% of a T4 without the ability to impact the 20 - 25 users on the second RDSH VM sharing the GPU, which will be as good / better than an entire 8GB M10.

If you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I’ll save the the trouble of running a POC … Don’t bother, the user experience won’t be good enough. You’ll need the configuration I’ve mentioned above :-) If 4 T4s don’t fit your budget, then use 2 M10s (per server) instead, again with the 8A profile (that’s 4 RDSH VMs per M10), but you’ll still need 2 DL380 servers to hit your number (3 servers, if you want to include N+1 resilience), assuming that you can get 25 users per RDSH VM to hit your 400 user peak.

Regards

MG

Awesome information ! That’s exactly what I was looking for. Had to read it 3 times before it sank in fully :)

"Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time"

=> ouch yes, I can see how that will affect my scaling options as well as the maximum potential vGPU performance a single user can reach.

"If you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I’ll save the the trouble of running a POC … Don’t bother, the user experience won’t be good enough"

=> Thx, I think you just saved me quite some ‘hard lessons learned’ time :)

Is there any technical documentation where I can further educate myself on how the scheduler and framebuffer work at a technical level ?

I will take all this information into consideration in my design and total cost considerations. I can already see how this will affect my choice and scaling options.

Hi

Sure. All vGPU documentation is available here: NVIDIA Virtual GPU (vGPU) Software Documentation When your POC begins you’ll be running the latest version (currently 9.0) so just select "Latest Release" for the most up to date features and functionality.

The piece of information you’re looking for relating to the Scheduler is located here: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy

There’s a lot of information in those documents and sometimes specific details aren’t that easy to locate. In that case, just use the "search box" (top right) to scan through all of the documentation for specific key words to help find the information.

Regards

MG

Update: After digging into the documentation I’m now currently 1 week into a POC I set up and I really like it. it’s working perfect as I would like it.

It’s basically 1 ‘reference’ physical host server as follows:

1 HP DL380 G10
dual Xeon 6254 (yes we need the single core performance for our legacy applications)
12 memory modules of 64GB (2933Mhz)
8 VM Xenapp servers memory each of which 8GB is used as the PVS writecache disk

Also indeed like you said I need at the very least 8GB profile per VM Xenapp server with 15-20 users average on it and as soon as I have 15 users on the server this can be maxed without them doing anything special. This limit worries me tbh.

This is probably my only complaint so far: Why do these Tesla T4 cards only come in 16GB editions ? For our knowledgeworker profile it seems we need alot more memory to be future-proof. With 8 of these VM’s per physical host server I would need no less than 4 T4 adapters. For 5 host servers that would me 20 !!! cards while their actualy raw gpu power is hardly being used.

Why are there no T4 cards with 64GB (yet) ? That’s my big question towards Nvidia as I’m testing this.

Maybe they plan a T6 or T60 sometime soon which is going to be perfect for Xenapp scenario’s ? That’s what we really need here.

Besides that, in combination with Citrix latest technologies (VDA 1906.2 and workspace client v1909) and its "only accelerate moving parts of the screen" this is definitely my choice of platform for the future. Loving it already

Hi

Glad the POC is going well.

However, 15 users is a little below expectation and certainly below average.

  • What are you using to monitor the GPU utilisation?
  • How many monitors and what resolution are your users running?
  • Have you optimised / tuned the Operating System?
  • What kind of applications are you using and which ones are using the most framebuffer?
  • Have you classified your applications / users correctly? (I ask because you have a strange choice of CPU for Knowledge Workers)

Regards

MG

oh totally missed your reply again. That’s odd, I even remember checking like a day after.

-Using GPU Profiler 1.07a and RD analyzer to monitor Xenapp servers as a whole. Process Explorer to monitor usage per application per xenapp server.
-Most users are mostly 1-2 Full HD monitors with some having QHD screens
-I’m not sure what tuning you are referring to exactly in terms of tuning the OS.
-This Citrix farm has a design where the resource pools are shared. That means no dedicated different setups for different users. Each Xenapp server has to be able to service all applications at fullest, from the lowest till most intensive. I say ‘knowledge workers’ because there are no graphic designers but alot of legacy applications (Lotus Notes, custom reporting tools, As400, …) are purely single threaded. Because of their nature they completely botleneck one core and run in a single thread. The highest single thread performance directly translates into equally scaled performance and speed increase. Since we’re in a POC that theory has already been verified and confirmed in practise.

The 15users is the average guaranteed I’m designing and aiming for. The last days I run 18-20max users in reality. The few days I ran with 23 users, the servers with a vGPU stalled while the double-blind proof servers with a vGPU didn’t under the exact same conditions. As soon as I have more than 15users, all framebuffer memory is being consumed. I suspect some overload being reached somewhere when 23 users were on it. However, overload testing is out of scope in this POC since they are productional and I only have the hardware for 2 more days before the POC ends.

Hi

Ok, so designing a XenApp VM to cater for 15 users minimum, if you use 5 T4’s in a DL380 that’s at least 150 users per DL380 (so you’ll want 4 of those to account for N+1. That will also give you a little headroom). As your users have differing monitor configurations and don’t all use the same applications, your density will vary depending who logs on to which XenApp VM.

Regarding OS optimisation, once everything has been installed, you should be running something like this from Citrix: Citrix Optimizer Tool OR this from VMware: VMware OS Optimization Tool | VMware Flings to remove any unrequired services and tasks etc etc and maybe even running "Citrix Workspace Environment Management" as well.

I’ve spoken to someone from the NVIDIA Performance Engineering Team and also other members of the NGCA Team, and for all of our deployments, the average for an 8GB GPU is 20 - 25 users which is pretty much what was discussed above. The person from NVIDIA Performance Engineering said that he was managing 25 - 35 users on an M10 (per 8GB GPU) using a single HD monitor with his specific set of applications. The T4 split into 8GB vGPUs gave the same density for his workload. So with your 18 - 20 users maximum, you’re not a million miles away with 2 HD or QHD monitors.

Regards

MG

thx for your reply MrGRID

ok, my test build is not optimized the best possible most likely. That could squeeze out just a few more users I guess. But roughly speaking and choosing to be on the safe and future-proof side with my 18-20 users currently I guess I’m there basically.

Still that means to me there’s really not enough memory on those T4 cards. 2 reasons:

-Future growth: We all know that if this is what you need today, you’ll need double in a few years.

-We don’t have 5 slots because we need some of them. First of all we need the base riser for its flexibility with the m.2 slots and we actively use 10GB PCIE network cards. We could use the second riser to put 2 cards in and the third riser for a third card. And if we really want to talk min-max and saturate the whole server PCIE capacity (bad design imho) from the very start we could share the x8 slots on the first riser with 1 additional T4 card, making it 4 in total. BUT again you would saturated from the start with your maximum 4 cards AND it becomes ridiculously expensive because we plan 5 servers like this. We’ll need no less than 20 of these expensive cards ! And all this while average gpu usage right now is around 20-30% meaning we need more memory on them, not more gpu power for our Xenapp worker profile.

If only Nvidia would understand this message that even though T4 cards are well-suited for Xendesktop/Horizon scenario’s, they are not optimized for Xenapp scenario’s simply because of lack of memory. Even though we can squeeze it and force it using the most flexible servers such as HP DL 380 G10 the fact remains that the card lacks in this particular area. A shame because besides that the solution technically works and I like it.

I definitely want to implement this tech sometime in the future when 32/64GB comes to life but right now -in the current state- it’s a hard sell unless money doesn’t matter and you can scale out over as many servers as you like :(

Maybe the Nvidia Marketing team will see the light soon and bring out a T6 adapter with 64GB mem for this purpose and to replace M6, M10, …

Thanks @MrGRID for your very detailed sizing responses and @Profundido for your honest feedback.

1 Like

Update: By now we have new production servers up and running of which for some of them we use a number of Tesla 4 cards to make the most of vGPU acceleration. It works really well besides the framebuffer memory limitations described above.

Question: Now that Ampere has been announced, is there a any news on a planned Ampere-driven successor to the Tesla T4 with full support for grid software or equivalent ?

@MrGrid

Hi

Glad you have the platform up and running!

Regarding Ampere … Even if there was an announcement from NVIDIA about any new GPUs, it’s unlikely they would be available for a few months, and then demand would undoubtedly be greater than supply causing a shortage.

If you need more GPU capacity there’s a couple of initial options:

  1. Purchase an additional DL380 with the same Spec you have been testing on to reduce the load across the other Hosts.

  2. Purchase additional T4s and run 1x T4 per XenApp VM with a full 16GB Profile.

How much additional Framebuffer do you need on your XenApp VMs?

If you were to increase your Framebuffer, how much CPU and RAM do you have in reserve at the moment to account for additional density?

Regards

MG

Damn, still not receiving email notifications upon new posts even though I’m following…good thing I check back here manually.

I’ve basically loaded up HP DL 380 G10 machines just as used in the POC with the maximum amount -realistically possible- which is 4 T4 cards at once per physical HP host server. However, this directly restricts the maximum number of users I can load up on servers. Without vGPU acceleration I’ve easily tested up to 30 ccu per virtual server whereas with a 8GB vGPU profile assigned to that same virtual server the maximum ccu becomes restricted to 15-20max because of the limited framebuffer memory available and the fact that a server hard crashes (hangs) when the limit is exceeded so you can never risk to scale up to maximum density. 8 of those virtual servers means I cannot assign more than a 8GB vGPU profile per server using the current generation (T4) of vGPU cards.

So in other words I basically need at least double of the current amount of memory in order to be safe for now and the upcoming years and be able to use the full potential of my servers. Until such cards are avaliable I’m forced to scale down the max number of users by 30%-40% because of the T4 memory limit. So I see no other option than to pray for successor cards with at least double the amount of memory.

Addendum: I have 768GB of memory on each HP physical server and 2 Intel Xeon 6254 so currently each of the 8 virtual servers has access to a dedicated 16 cores and 64GB of dedicated memory which I could even easily increase up to almost 96GB. So no problems there.

Hi

Thanks for the update.

Not that it’s any consolation, but if you’d used the M10, you’d still be having the exact same issue. I know the 8GB limitation has been reported before, hopefully when the next GPU models arrive there’ll be a specific model targeted at higher density.

Out of interest, what else is installed in the remaining PCIe Slots that’s stopping you adding more T4s?

Both HP and NVIDIA are showing that the DL380 G10 will support up to 7x T4s now. The T4s can go in either the x8 or x16 slots (or both) with this use case if that helps.

Regards

MG

Thx for your answer MrGrid.

To my current knowledge based on extended research into the existing HP riser possibilities (from the POC time till right before the purchase of these servers) and manually installing those extra risers myself as well I’ve verified that the absolute maximum of simultaneously available 16x slots in an HP DL380 G10 server is limited to 5, not 7. Especially if you want to make sure you can add external power to those slots (for e.g future beefy nvidia vGPU cards like the generations before T4) May I ask what source of information mentions that 7 PCIE 16x slots is possible ?

Since we need the original riser with extra 4x/8x slots etc for some 10Gb+ PCIE Nic and/or PCIE solid state drive functionality options we have explicitly chosen not to sacrifice the entire default riser with all its extra functionality just in order to have 2 PCIE 16x slots instead of the one PCIE 16x which is already present on that riser by default. So this way our maximum possible of 5 becomes 4 while not having to sacrifice all extra functionality of the default riser which is the most versatile and future proof configuration imho based on my research into all possibilities.

For reference, I used the following sources when preparing my POC and purchase:

(matrix on page 9 shows all Riser options and combinations)

(The T4 Datasheet clearly shows "System interface" PCIE x16 so I knew I could not consider alternative risers with e.g more x8 slots instead of fewer x16 slots)

Besides all that I have to note that even if 7 PCIE x16 slots would have been possible and we were to chose to go that route, even then I would still not be able to provide all of my 8 virtual servers (and room for more) with 16Gb of framebuffer memory so Ideally I’m hoping for 4 future vGPU adapters with around 32-48GB dynamically divisible memory each and then I can maximize the full potential.