vGPU for AutoCAD/RDSH questions

Hello Gurus,

We have taken on a new client who has experienced considerable growth in the past 2 years. There are offices in multiple locations across the country (3 offices, approximately 30 employees total, for now) each office is connected with high-end Cisco/Meraki VPN gear.

We are wanting to go "server-less" at the remote offices by way of thin-clients and having a single (or potentially dual, for redundancy) robust servers at the head office running Server 2019 and RDSH for the "typical" employees. They also have 3 engineers who need to use AutoCAD.

I am not familiar with Nvidia products, but I have done quite a lot of reading on your forums. In theory, I should be able to use something like an M10 (quad GPU?) and allocate 1 core each to 4 virtual machines, correct?

In this scenario, I would deploy a Windows Server 2019 RDSH VM - this would serve ~20 CCU’s, a few of these users need the capability to open, say, Google Earth. Google Earth does NOT respond well in a terminal server environment with no graphics hardware. Hence the possible solution of using 1 of the 4 gpu’s on the M10.

And then, in addition, I would also like to deploy 3x Windows 10 Pro VM’s, and install AutoCAD on them, for the engineers, and then assign the 3 remaining gpu’s to each of these VM’s. Should this work?

And then what about future growth? Let’s say a year from now they have 6-8 engineers needing to utilize AutoCAD. Is it simply a matter of slapping in another Nvidia M10 and assigning more cores to more VM’s?

Thanks in advance, and my apologies of any of these questions have been answered before.

Correct observation. And yes, you could add another M10 next year to react on additional growth. In addition to the hardware you will need vApps and QvDWS licenses.

regards
Simon

I realize there will be licensing costs involved. Thank you for the quick reply. I truly appreciate it.

Would you recommend starting with an M10? Or is there something better suited to the proposed deployment?

Are there any vendor-specific pre-configured servers you would recommend? We typically build our own servers in-house, my usual intel-based 1U-rackmount server will obviously not be ideal in this situation.

For professional 3D (like AutoCAD) I would recommend Tesla T4. So it might make more sense to start with 2x T4

Hi

Just to add to Simons comments, although the M10 is still available and fully supported, the fact the architecture is 3 generations superseded should not be overlooked. Bottom line, you’re buying old technology. In my opinion, unless you’re really looking to deliver the cheapest solution, in which case the M10 is perfect (note cheapest, not best), you’re better off with multiple T4s to replace it. The T4s are the latest (Turing) architecture. They support the latest codecs, have better encoding capabilities, more scheduler options, will be supported for longer, use less power and are just generally a more future proof investment over an M10. Also, if you’re running the latest GPU architectures, you can always take advantage of the latest software (vGPU) enhancements.

As you have CAD users, making sure you select the proper hardware specifications is even more important. Although you can run CAD on an M10 using QvDWS licensing, it would be my last choice out of the current GPU lineup available. Also, as you’re supporting CAD users, don’t forget the CPU ClockSpeed is critical. Typically, nothing below 3.0Ghz.

Regards

MG

Gentlemen, I appreciate the feedback immensely. And this is the exact kind of input I signed on for. I’ve deployed hundreds of terminal servers over the years, but I’ve never been asked for an AutoCAD capable remote environment.

I will look into the T4 GPU’s as recommended.

I typically utilize Intel Xeon Silver 4208’s which have a base clock of 2.10GHz and a turbo clock of 3.20GHz, but again, as you’ve seen these systems in production, I’m certainly open to suggestions.

-Rob

Seriously with 2.1GHz? You can never ever count on turbo clock with server systems, therefore this CPU is totally useless for the use case given. As MrGrid mentioned already you should have at least 3Ghz for 3D use cases.

Hi

Please don’t use 2.1GHz for AutoCAD. Yes it will load and run etc, but the user experience will usually be poor when scrolling in and out, manipulating designs and objects. These CPUs are for entry level workloads, typically classed as "Digital Worker", not CAD.

Forget about Turbo, there are way too many caveats to get it to work consistently and properly in a virtualised environment and just because the boost clock will go to 3.2GHz, doesn’t mean your users will always get that if or when it the CPU decides to boost. On a physical workstation with no hypervisor, sure, but on a virtualised environment, whole different thing. Then you’re into "All Core" vs "Single Core" boost and all sorts of other variables that make it deliver inconsistent performance. A far simpler approach is use a CPU with a faster base clock and forget about Turbo. The users will have a far more consistent experience, as the minimum they’ll be working with is 3.xGHz.

Either of these two CPUs are fine, however, personally I’d go for the current version (Gen 2)

Gen 2 - 18 Core 3.1GHz

Gen 1 - 18 Core 3.0GHz

I typically use the 18 Core variants because this gives more flexibility per Host and provides the best balance of Cores vs Clock (there are obviously other CPU variants with either higher Clock or Core count for more specialised use cases). With most OEMs supporting 6 > 8 T4s per Host, having 36 physical cores will allow you to provide the best experience and give you the most headroom for additional density (It’s far easier to add additional GPUs than change out the CPUs when you need additional capacity!).

3.xGHz is a bit of a waste for typical Digital Workers, but, you need to look at the overall bigger picture for the hosting environment and whether you want to run RDSH and CAD Workstations on the same physical Hosts. It would be a slightly unusual configuration, but is completely possible and yes you can do it as the Workstations and RDSH VMs would run on different GPUs. Use the software to carve up the platform, not the hardware.

Regarding how you support the RDSH and CAD VMs, you can do this in a couple of ways. You can either have two different CPU specifications in your hosting environment (one specification running 2.1GHz and the other running 3.xGHz CPUs) for each workload, or you can have a single specification environment where all hosts are the same and you control performance through the Hypervisor (as mentioned above, again using the software layer, not hardware). My preference would be the single specification as this gives you more flexibility in terms of resilience, migration, maintenance and manageability and it gives you a single price for scaling out the environment. You may or may not be using it, but this is something that Hyperconverged is really good for as everything comes in a single Node. Yes, Hyperconverged is overkill for just this customer, but looking at the bigger picture if you have multiple customers …

For the RDSH VMs, allocate each of them the 8A vGPU Profile (8GB of FB) from the T4 (1x vApps license is required per CCU) and change the vGPU Scheduler to Fixed. Changing the Scheduler is really important for the RDSH VMs. Ensure that "GPU Consolidation" (named differently depending on the Hypervisor choice) is configured so that the RDSH VMs all end up on the same GPU. The alternative to that setting, depending on your Hypervisor choice, is that you can even select which vGPU Profiles run on which specific GPUs. For example, this means that you could have two T4s that will only run 8A vGPU Profiles, and all other T4s run other vGPU Profiles (except 8A). What this does, is let you configure a different Scheduler (Best Effort or Fixed) on specific GPUs, and then specific VMs (CAD or RDSH) will only start on those GPUs and have appropriate access to them. You wouldn’t want the CAD VMs running a Fixed Scheduler for example.

For the CAD VMs, do some homework first, it’ll save you a lot of issues in the long run. Find out what kind of monitor configurations the CAD users are currently running for a starter (1080P / 2x 1080P / QHD / 4K …) and whether there are plans to upgrade these in the future (if there are, factor that in now, not at a later date). This will at least give you an indication of which vGPU Profile to use initially. Unless they’re doing something crazy, you’ll more than likely be looking at between a 2Q or 4Q vGPU Profile (2GB or 4GB) (QvDWS license is required per CCU (for this specific use case, think of it as per Workstation, it’s easier)). If you’re unsure, go up in vGPU Profile size, not down! The reason for this is simple … If you go down in vGPU Profile size trying to cram on as many users as possible to hit your ROI, and then build out the environment with that specification based on a specific user density (defined by vGPU Profile size) and cost model, and your users suddenly start:

  • Running larger models
  • Have an application upgrade (AutoCAD 2018 > 2019 > 2020)
  • Introduce additional applications that have not been considered or evaluated
  • Have a monitor upgrade from 1080P to 4K
  • Want to update Windows 10 to a newer version (Eg 1803 > 1909 > … )
  • This list is not exhaustive …

… Then you now need to increase the vGPU Profile size by just one step (2Q > 4Q (on a T4)) because there wasn’t enough headroom in your original configuration. As a direct result, you now halve the density of your GPU (platform), meaning you now need to buy more physical servers, or you have to limit the customers in what they want to do, neither of which are good options. If unsure about monitor configurations, again, always go up, so configure for 4K, that way you know you’re going to be covered.

Something else you’ll ideally want to do is monitor utilisation of the CAD users existing workstations to see how much resource they’re currently using before deciding on an overall CAD VM profile. You can do this with great little tool called GPUProfiler ( Releases · JeremyMain/GPUProfiler · GitHub ). This tool was created by a friend of mine (Jeremy Main) who works at NVIDIA and who manages and updates it. It’s a portable .exe so no installation is required, just set the amount of time you want it to monitor for and you can export the results in .csv to check resource utilisation (it’s a fantastic tool). This will give you a pretty accurate idea of how much resources are being used. Run this on as many CAD users workstations as possible to get the best range of metrics. Don’t forget to factor in headroom when you configure your spec based on any results!

This is something I see a lot. Customers forget that if they get the vGPU sizing wrong by trying to save money at the POC stage, at best, it’s going to halve their server density in a production deployment when updates happen (which they will at some point). This is really important, and something that I hardly ever see talked about.

Also remember that SSD / All Flash storage is now a defacto standard with VDI / RDSH. Sometimes NVMe is required, but only for higher performance workloads. Typically (there’s always the odd exception), mechanical spinning disks are now considered for bulk storage only.

You haven’t mentioned which protocol you plan to use for the environment ? (HDX 3D Pro, Blast, PCoIP, TGX). You need to check with the CAD users about their peripherals (3d Space Mice, Tablets etc etc) and then make sure they are all supported with your chosen Protocol.

Remember, if in doubt, go up in vGPU Profile size. You can always scale down if it’s too much, but you can’t (easily) go up if you start too low.

Out of interest, which Thin Client solution are you planning to use?

The whole point of my comments above and how I personally work, is to try and mitigate future developments and changes as much as possible so you don’t have to upgrade specifications for the foreseeable future. If you do, it will typically cause issues in one way or another. A lot of things that will cause future issues can be mitigated if taken in to account at the beginning of the deployment. There are always things that catch us out, but the majority of things can easily be planned for. There’s nothing worse than a customer coming along and asking for more performance from a newly deployed platform, only for the system architects / engineers to realise they don’t have it without significant changes.

Apologies, I do tend to rattle on when I get going, but there’s a lot to consider and be aware of right from the start so the correct expectations can be set. When deployments under-perform or go wrong, those deploying it have a habit of blaming the technology which is rarely at fault, and it’s typically the person deploying it didn’t understand what they were doing. Not saying that’s the case here, just covering all bases :-)

Best of luck! Let us know if you need any guidance.

Regards

MG

Wow, what a summary!!!
Really appreciate your effort.

Regards
Simon

Good day once again.

And holy-moly your input is invaluable.

Clearly the Xeon Silver 4208 isn’t going to cut it. If I’m not going to use the Xeon Silver 4208, I was looking towards Xeon Gold 6244. I will continue to research and look into the processors you’ve recommended. 30 users will be typical "knowledge workers" and only 3 users will be "power users" requiring access to CAD software (for the time being) and obviously planning ahead/future-proofing as much as possible in the beginning will no doubt save headaches down the road.

I am not presently entertaining the idea of building two servers with different hardware configs. If I’m going to build one "cheap" server for traditional terminal services, and then another "expensive" server for the engineers… I might as well just deploy a terminal server and get the 3 engineers really good desktops. Right now, every employee has their own desktop/laptop PC. Many of which are old and need replacing.

I will try to sit down with the engineers and log their utilization. That’s a brilliant looking little app.

I haven’t deployed spinning disks in any equipment for years. Worry not on that point. I typically use Samsung for my NVMe devices, and Intel Datacenter SSD’s for 2.5" applications.

I am presently planning to use HYPER-V for everything.

For thin clients, either Dell/WYSE terminals or Lenovo. (There will also be several users working remotely using RDP for Mac or RDP on their windows based laptops.)

Don’t apologize at all. I came here looking for a wealth of information, from people who have successfully implemented these systems. And I have not been disappointed or overwhelmed at all. Rattle on. Please.

Hi

You’ve gone a little bit too far the other way regarding CPU choice. Ideally you’re after a balance of Clock vs Cores unless it’s a specific workload you’re targeting. When you move much above 3.0GHz, you start trading Clock for Cores, which is why 3.0 / 3.1GHz is the sweet spot. You “could” look to configure the server with a single socket populated with the 18 Core / 3.1GHz. This would give you the base configuration to scale up when needed, but keep costs down initially. Obviously consider Memory and PCIe architecture at the same time during configuration.

There’s a couple of ways in which to configure for 30 Concurrent Digital Workers. You can scale up or out. To scale up, you’ll need the whole T4 assigned to a single VM. Configure CPU and Memory accordingly dependant on testing. As a ball park, start with at least 8 vCPUs and 32GB Memory. Increase as required dependant on testing (I would expect you to need more than that, but those CPU and Memory amounts are a good place to start). My expectation would be that your limiting factor will ultimately be GPU Framebuffer, as this is a hard limit, and I would expect you to hit “approx” 40 > 50 Concurrent users (depending on applications, usage and OS optimisations) if you have a well optimised OS, really light workloads and single 1080P monitors on the client side, then 60 is a strong possibility. But if the users start pushing Google Earth around, you’ll be on the lower end of those numbers. There isn’t a definitive number as there are a lot of variables to consider, but you should be in that area. Much less than that and you have a configuration issue in the stack.

The other option (scaling out) is going to be problematic. Unfortunately, out of all the Hypervisors available that support GPUs, Hyper-V is the worst choice for GPU virtualisation (in my opinion). Microsoft (for some reason) made a real mess of adding this. Hyper-V doesn’t actually support vGPU (Plan for GPU acceleration in Windows Server | Microsoft Docs). You can run GPUs in Passthrough (referred to by Microsoft as DDA) but you can’t share the GPU between multiple VMs. You either need to add the whole GPU to a VM, or none of it. That’s ok (it’s not really ok, but it works) with RDSH (kind of), but you’re limited to 1 VM per GPU. Now, if you were running an M10 which has 4 GPUs on it, you can run 4 RDSH VMs (each GPU passed through to a separate VM). But with a T4, where you’d typically split the GPU and run 2 8A vGPU Profiles to support 2 RDSH VMs, all you can do is allocate the whole GPU in Passthrough. So if you go down the Hyper-V route, you’re going to run into limitations very quickly.

If you were just running RDSH VMs, this would be ok. Not ideal, but ok. As you could just use M10s to scale out, or a whole T4 to scale up. But you also want to support CAD VMs, and here’s your problem. With XenServer, vSphere, AHV or KVM, you can run all those CAD VMs (3 at the moment) on the same GPU, if they need a 4GB Profile, you have additional capacity for 1 more, if a 2GB Profile you could get 5 more on there. However, with Hyper-V, each CAD VM will need it’s own T4(!) and that’ll blow your cost model right out of the water.

Great that you only use SSD / NVMe!

Regarding Thin Client choice, make sure you test various models / specifications before deciding on a final model. WYSE, IGEL, 10Zig will all send you evaluations if you request them. It’s also well worth considering using something from NCOMPUTING ( https://www.ncomputing.com/ ) who will also send you one for testing. The NCOMPUTING clients are great little devices that are inexpensive and perform really well, they also have a management platform so can easily be configured before deployment. Regardless of which you select, don’t underestimate the importance of a “good” end point. Just because the back end has high performance components where the heavy lifting is done, doesn’t mean that the end point can be skimped on. Your choice in Protocol will also impact your choice in Thin Client. Treat the solution as an end-to-end system, and every component impacts the user experience.

Regards

MG

1 Like

I will take what you’ve said regarding Clock Speed vs Cores/Threads to heart.

30 users is the TOTAL number of employees at this point. I wouldn’t expect concurrent users to be above 20.

Going with a single CPU for cost-savings isn’t much of a concern. At least not at this point. We’ll see where we’re at when I start putting the final numbers together. (As a side note, where can I see Nvidia licensing costs?)

I also realize that HYPER-V is the "worst" hypervisor currently available, but, it’s what I’ve been using for over a decade, and I have no experience with VMware, Citrix, Linux, etc.

I don’t mind the limitation of dedicating 1 physical GPU per VM. Hence why, initially, I asked about a quad-gpu (M10) to power 4 VM’s. Is there a card like the T4, but with multiple GPU’s per pcb?

If I have to start re-thinking my methodology, I certainly will, as Leela’s office poster says; "You gotta do what you gotta do."

Hi

Yes, official pricing is available here: https://images.nvidia.com/content/grid/pdf/Virtual-GPU-Packaging-and-Licensing-Guide.pdf However, depending on your location (I’m UK based) I can provide you with a quote if needed, or you can use the NVIDIA Partner Locator ( Find an NVIDIA Partner | NVIDIA ) to find someone who you’re more familiar and comfortable with and / or local to.

Going with the single populated socket approach is a good way to build in scalability at a later date but keeping setup costs at a minimum until that extra density or performance is needed. A single 18 Core / 3.1Ghz CPU would easily support your 30 (20 concurrent) users (including 3 CAD VMs). The largest expense outlay is always the first one. After that, components and resources can start to be shared and therefore scaling typically becomes cheaper than the initial outlay. Unless you need everything populated from the start, it’s better to start small but ensure you account for scaling up, otherwise just getting off the ground can be expensive.

Hyper-V … When I say worst, it’s important to keep that in context. For GPU virtualisation where you want to share the GPU between multiple workloads or users (which is typically where everything is moving to now), yes, Hyper-V is currently the least favourable to use as Microsoft decided to do their own thing away from the other Hypervisor vendors (Remote-FX, which was limited anyway and still lacking full functionality, but has now been discontinued). Due to a technology shift, it’s now becoming difficult to find consistent uses cases for it where you can get the best out of the technology stack, as is proven in your use case. That said, GPU-P (GPU Partitioning) is coming. However, how this will work with Hyper-V and NVIDIA is unclear at the moment. I’ve not seen much about it to date, but it’s always worth keeping tabs on Azure for indications of what may feed down to Hyper-V in the future. If you wanted to use another Hypervisor but are unsure due to experience, there are ways around that …

The only publicly available, supported, Multi-GPU boards are the M10 and M60. There is nothing newer than that which can be purchased with Multi-GPU. After Maxwell, NVIDIA moved to a more software defined offering. This allows for greater development and flexibility, you just need to get passed Maxwell to get the full benefits of it.

If you wanted to use Hyper-V with the M10 and use 1 GPU per VM in DDA with scalability for adding more M10s again running in DDA (server dependant), then this will work, but at this stage it’s a limited solution (due to the functionality of Hyper-V and the age of the M10) and you won’t be getting the full benefits out of the technology, but importantly, it will work. Just be aware of the limitations :-)

Regards

MG

Providing the customer with a crippled solution to start, due to my own shortcomings, obviously isn’t ideal. Clearly this technology is rapidly changing, and staying ahead of the curve is difficult. It looks like the CPU’s are going to be far more expensive than the GPU’s any way you slice it. (I haven’t yet had a chance to look through the licensing costs).

Sharing a T4 for the RDSH users doesn’t concern me. It’s not like 20 of them are going to be using Google Earth at the same time.

Sharing a single T4 for 3x engineers all trying to use AutoCAD at the same time, however, does concern me. Is the card robust enough for that?

Hi

Baring anything too out of the ordinary in terms of working habits (which is why it’s really important to monitor your customers utilisation and then have them test on a POC before finalising a specification), a single T4 will be more than capable of handling those CAD users. This is due in part to the way in which CAD users typically work, but also in combination with the latest Turing architecture and encoders, however as mentioned above, don’t skimp on the CPU base clock speed.

If the CAD users are doing any design or "creative work" other than CAD (Rendering for example) then you will need to review the technology options just to be safe and make sure they’re still correct. Again, monitor their utilisation first (including all applications to make sure nothing’s been missed off the test) to confirm how much resource they currently use, then create your platform accordingly. A good indication is to look at their current hardware specifications. If they’re powerful, then they may be doing something resource intensive. If they’re low powered, then maybe they need an upgrade so they can work more freely, but it’s always nice when they move to a more powerful machine, so deliver this if possible.

I did touch on this further up, but now you’ve mentioned Hyper-V and RDP connectivity to me, it’s just reminded me … Unless I’m mistaken (wouldn’t be the first time), I’m pretty sure RDP doesn’t support USB forwarding. What this could mean, is that if your CAD users use 3d SpaceMice (like these: 3Dconnexion UK - SpaceMouse, CadMouse, Drivers ) they may not work in your VMs due to RDP. It’s well worth speaking to your CAD users about their peripherals to make sure you know what they use. And those 3d SpaceMice are pretty much industry standard (they’re beautiful, very high quality, precision devices. I have one myself, it’s like working with a piece of jewellery). I’ve done a little digging for you, it would appear that the users in this forum may have a potential solution, although I can’t confirm or recommend it due to not having used it: Remote Desktop use of local Space Navigator - 3Dconnexion Forum Again, you may not need it and it may not be an issue anyway, but double check with all of the CAD users, and make sure you don’t get caught out by their peripherals.

The main thing to be aware of with the RDSH users, is Framebuffer and Encoding. As you’ve worked with RDSH a lot, I’m sure you’ll understand just how much memory Google Chrome or other modern browsers can consume when they have multiple tabs open. Even things like MS Office "expect" there to be a GPU in the system and are looking to use it by default. When you monitor the GPU utilisation on a loaded RDSH server, you’ll be surprised where the limitations are :-)

Don’t beat yourself up, it’s difficult to stay up to date as this kind of technology is evolving fast and we all have our shortcomings. Mine is Networking. I just never got in to it in my early days. I know what my platforms need from a Network and know the very basics, but configuring a Cisco Switch, Router or Firewall would leave me struggling. It’s just not my thing.

I’m going to drop you a quick PM on here …

Regards

MG

to MrGRID: "I also realize that HYPER-V is the "worst" hypervisor currently available, but, it’s what I’ve been using for over a decade, and I have no experience with VMware, Citrix, Linux, etc."

I do not see any difference between MS DDA and other pass-trough technics: all of them use the same PCIe functions for own goals.

Only one difference exist: there is now any profiles to divide such cards (physical PCIe function) as nVidia T4 onto virtual PCIe functions resources like as it do for other hypervisors.

Could anybody describe what does it mean word "GRID" in MS Windows Server driver without such profiles support?

When T4 resources are able to partitioning in ESXi, Xen and KVM, why they are not able to be devided auch a manner for MS DDA?

Hi

As I mentioned in a previous post further up the thread, the term "Worst" needs to be kept in context. "Worst" being its overall graphics support. I’m sure Hyper-V may have other advantages, but as I solely work with GPU workloads, I stopped using it many years ago so am not able to articulate them.

Passthrough and DDA are the same thing. It’s just Microsoft trying to look like they’re doing something different by calling it something else.

When using vGPU, you do not pass the GPU through to the VM(s) (there is no PCIe Passthrough), you allocate a portion of the GPUs Framebuffer through the vGPU Manager and each VMs access to the GPUs resources is time-sliced using its Scheduler. This requires virtualisation, and this is the part that Microsoft do not support.

For clarity, NVIDIA vGPU (originally named "GRID") is a package of components (Software and Hardware) that form the solution. By using specific GPUs with specific Software, you are using vGPU (GRID).

With all of that said, if you look at what NVIDIA have done with the A100, then Microsoft might be an ok choice once NVIDIA release the new graphics line for Tesla and Quadro. We’ll have to wait and see what gets released …

Regards

MG

The primary concern to know about with the RDSH clients, is Framebuffer and Encoding. As you’ve worked with RDSH a ton, I’m certain you’ll see exactly how much memory Google Chrome or other current programs can burn-through when they have different tabs open. Indeed, even things like MS Office "expect" there to be a GPU in the framework and are hoping to utilize it of course