13 months with NVidia GRID and XenServer

Virge · November 23, 2016, 9:33pm

Morning all,

As I enter my 2nd year on this journey with Citrix and GRID, we run 90 Autodesk employees on a myriad of configurations. I have to say that this tech is still what the Toyota Prius is to a Maclaren P1.

We have spent around $460,000 implementing the solution and even though it sort of works, if i had my choice again, i would have just bought 90 workstations @ $2200 each. Would have saved myself $250K and would have a smaller workload.

Lets go through why and I would ask fellow readers to chirp in with their opinions.

AutoCAD workspace is very hard to configure over Citrix, even with mouse cycles and various other reg hacks. It is still inaccurate, laggy and a pain.
Citrix support is non-existent for grid. You have to trawl the web for various blogs, twitter for more updates and 3rd party tools to tune the system to make it remotely workable.
NVidia Grid Support - well, this is all i can find from the supplier ans a tonne of marketing stuff in social media
We use K2, it was the best i could buy last year. K220 is not powerful enough, K240 wont do 3D very well, K260 is usable and a full GPU is what the user wants for a decent drawing. So at best i can put 8 people on a $35K server. That is $8K per user.
The software is very buggy. Right now our Win10 image will boot with a K240, but if i change to anything higher - i get continuous reboots.
The incessant patching, i think we are on Patch 17 for Xen server and we just had a point release for the GRID cards, which as above killed half my workforce with the K260. (K260 works on Windows 7 though…)
The other 3D apps are quite good, SpaceGass, inventor, Navisworks and Revit all seem to be much better to use than AutoCAD. But they all need 2GB vid cards which is the 260 profile.
management is easy as you only have the one workstation image to update, you do need to update the personal vDisk on each machine after each update - allow half and hour of downtime for each machine whilst this is done.
Network traffic and bandwidth is just weird. We run a LAN setup and some VM run at 60Mbit for about 20 minutes at a time and others sit at around 6. NFI why.
Power management and getting VM’s to switch on and off when you want them to is a bit of hit and miss, the DC have power management - but it doesn’t work, so you have to use the shell.
Dont get me started on NVEnc and the various video modes…

Actually - i will stop there as i am starting to sound like whingey old man. The takeaways from my experience is that, we have the system working, but both the end users and the IT team wish we hadn’t implemented it. Whilst the tech seems good and we can all make a demo that looks good, it is simpler and cheaper just to run workstations and servers, whilst using DFS or PeerSync to organise the data between sites.

The main problem is that you are on your own when you implement this, Citrix support has no idea what GRID cards are and you are on the bleeding edge working around bug after bug after bug. Every new update we get from Citrix or GRID may fix one issue, but the next few days you are working around new ones. I have a team of 5 guys all with Citrix Qualifications and we are just tired, worn out and the last GRID driver update that broke the 260 profile has forced me to write this.

We are in too deep to change it back now - so we will continue to try and get this to a professional functional level, maybe the M60 cards are better, but do i want to throw another $50k at video cards that are only a year old ???

i wish you all the very best, thanks for reading this so far.

sschaber · November 24, 2016, 1:47pm

Hi Virge,

I’m sorry to hear that your GRID solution is not working as you would expect. But I would like to comment on a few things. First of all there is much more to take into consideration than just Citrix and the GRID card. Especially AutoCAD is one of the worst examples at all running on VDI due to the fact that it uses server rendered cursor and therefore produces additional latency which causes the laggy mouse feeling. You should really start to test with NVENC to reduce the latency which definetely helps to massivly improve the AutoCAD user experience.
Apart from that you didn’t mention the hardware sizing of the hosts at all.
Autodesk products also need a CPU with high clock frequency (>3Ghz) and to be honest I don’t understand why AutoCAD should need more than a K220Q profile. It’s all about Framebuffer and for Win7 a K220Q profile should be fully OK in most cases. For sure Win10 is a different story as the OS itself needs much more FB and therefore you will need the next bigger profile in most cases.

I’d like to better understand what issues you had with the Citrix support in terms of GRID and with your GRID updates in general as I didn’t hear this before and I’m happy to assist you guys to sort this out. Just send me a PM.

Best regards

Simon

Virge · November 25, 2016, 2:30am

Howdy Simon,

Thanks for the reply.

I agree on the rendered cursor in VDI - it is awful and our main issue until Revit is rolled out next year. Our models are quite complex and we need the 240 just to get twin screen and NVENC running.

This issue with K260 not working in Win10 has "aggravated" me more than normal as i cant see why this would be an issue… I am presently building a new Win7 VDI with all the AutoCAD software to see if we can alleviate some of the issues.

We can get the mouse down to something usable, but a lot of our employees are plain out refusing to use the system due to the lag and delays.

all of our servers are dual intel xeon 2695 with 256GB of RAM and RAID10 for about 4TB of SAS 15K, they are damn meaty machines. Twin K2 cards in each machine. base setup is 12 on the K240 and 8 on the K220 for each GPU

The only downside is the architecture being 10/100 POE as opposed to Gigabit.

BJones · November 25, 2016, 12:00pm

Hi Virge

Interesting read above, thanks for taking the time to post on here.

Would you mind adding a few more details so we can better understand your environment?

You mention your employees have multiple monitors, what resolution do they run at?
What’s the exact spec of the XenDestop VM that each user has access to? (CPU Cores / RAM & K240 / K220)
What do your users have as endpoint devices and can you please detail the specs? (CPU (Cores and Clock), GPU, RAM, Windows or Linux OS)
Which version of AutoCAD are you running? (Also, is it fully patched and up to date?)
Which version of XenDesktop are you running?
Which version of XenServer are you running?
Which vendor server do you use? (HP, IBM, Cisco, Dell, SuperMicro)
What kind of Keyboard / Mouse combinations do your users have? (Are they generic Keyboards and Mice or are they designed specifically for CAD like something from 3dconnexion? Any wireless Keyboard / Mice combinations?)
As you’re on a LAN, I take it you do not use a Netscaler and go direct to Storefront?
Which Citrix Policies do you have assigned to the VM Delivery Groups?
I take it you’re using MCS for your images, not PVS?
Are you running your VMs from a SAN or are all the disks local to the servers?
Have you measured network latency / contention on your LAN?
Does the experience change depending on load? (Is it the same with 1 user on the system as it is with 90?)

Sorry, that’s quite a few questions, but you appear to have quite a few issues, so it’s good to know what we’re dealing with before we get going, and this list of questions is certainly not exhaustive ;-)

As Simon mentions above, it’s not just Citrix or NVIDIA you have to consider, it’s the entire end-to-end system.

=======

Let’s see if we can offer any advice with those issues …

Q: AutoCAD workspace is very hard to configure over Citrix, even with mouse cycles and various other reg hacks. It is still inaccurate, laggy and a pain.

A: Agreed with Simon and yourself. The AutoCAD rendered mouse is not great, but it can be tuned with the correct peripherals, registry keys and system specifications. I’ve delivered AutoCAD to a company in a different country and out-performed one of their local CAD workstations in terms of user experience and performance (which is absolutely insane when you think about what is actually happening in terms of connectivity), so I know it’s possible.

Q: Citrix support is non-existent for grid. You have to trawl the web for various blogs, twitter for more updates and 3rd party tools to tune the system to make it remotely workable.

A: Citrix first and foremost support Citrix technologies, not NVIDIA. So turning to them for support on specific GRID related issues (performance or other) may not be the best way to find a resolution. However, the Citrix and VMware forums are a good place to start overall. I mention VMware even though you run Citrix because some of the GPU issues people have are not actually Citrix or VMware related, but as they use either VMware or Citrix systems, they post in those forums.

Unless someone posts in the wrong section - for Citrix, you want to be looking at these:

For VMware, you want to be looking here:

https://communities.vmware.com/welcome

(Scroll down and on the right there is an "NVIDIA VMware Community" with sub-options)

Those are just the GPU areas, people also post in the more technology specific areas (such as hypervisor or desktop delivery system area (Horizon / XenDesktop) so it’s worth checking those too.

Then you have the various Blogs. It’s always guess-work trying to work out who knows what they’re talking about against those who don’t, but here are some great ones well worth keeping an eye on:

Rachel Berry (Ex CAD Architect at Siemens, Product Manager for Citrix HDX Graphics, and now NVIDIA GRID Product Manager. Extremely knowledgeable lady, great blog to follow!)

Marius Sandbu (Senior Systems Engineer and NVIDIA GRID Community Advisor)

Thomas Popplegaard (Independent Technology Consultant - Really knows his stuff!)

Magnar Johnsen (Another great blog, IT Engineer and Architect also NVIDIA GRID Community Advisor. Has designed some really great GPU monitoring tools)
http://www.virtualexperience.no/

Benny Tritsch (Technology Evangelist, Architect, Market Analyst and NVIDIA GRID Community Advisor)
http://drtritsch.com/

And there are many many other really great quality blogs out there (I could carry on for ages listing them all). If you aren’t already, the best thing to do to find them, is follow the tier 1 vendors on Twitter, then look to see who they follow and who Tweets them. This is where you’ll find A LOT of information.

And last but certainly not least, you have the NVIDIA GRID Forums with varying technology areas for both Citrix and VMware and the hardware vendors as well. I see this is your first post, welcome to the forum :-)

Hopefully the above will start you off in the right direction, and if it doesn’t, then let us know :-)

Q: NVidia Grid Support - well, this is all i can find from the supplier ans a tonne of marketing stuff in social media

A: Regarding actual GRID support, the official party line from NVIDIA is that with Kepler (K1 / K2) you need to go back to your re-seller for support. With Maxwell (M6, M60, M10) and newer, you have the option of direct NVIDIA support through their SUMs program.

There are also System Integrators and specialist GPU / 3D Consultancies that may be able to help. NVIDIA Partners can be located here: Find an NVIDIA Partner | NVIDIA These can be used for support and some of them offer application performance based SLAs if a managed service approach were to appeal.

Q: We use K2, it was the best i could buy last year. K220 is not powerful enough, K240 wont do 3D very well, K260 is usable and a full GPU is what the user wants for a decent drawing. So at best i can put 8 people on a $35K server. That is $8K per user.

A: Can you please confirm that figure? When dividing $35k by 8 users I come out with a different answer?

Regarding GPU density, the K1 / K2 was the first generation of GRID cards, so there were bound to be limitations. These limits have been raised with the Maxwell architecture and the M60 which has double the performance, double the capacity and offers many more H.264 encode streams compared to Kepler. Unfortunately, it is also double the price and just for the Cherry on top, NVIDIA have introduced a CCU licensing model as well. Make of that what you will, I know many customers are very vocal about it… But it is what it is.

AutoCAD performance is CPU limited, not GPU. If you want to improve performance, you need a faster CPU. When you move to Revit, this may not solve your problem, it may make it worse and the performance inconsistent. Revit renders on CPU, the GPU is not used for rendering. If you plan to render in Revit, you will need a fast CPU or performance will suck.

If you’d like an example of how Revit works and the issues it can cause for an under-spec’d system, here’s a link: https://gridforums.nvidia.com/default/topic/1020/xenapp-with-nvidia-grid/revit-cpu-load-on-xenapp-7-11/ Be sure to check out the attachment in the first post.

If you plan to do something a little clever (like use the Octane plug-in to make Revit render on the GPU), then that will massively help as it will remove a lot of CPU load, but I’d still recommend a faster CPU than what you currently have, 2.3GHz is not a fast CPU.

Q: The software is very buggy. Right now our Win10 image will boot with a K240, but if i change to anything higher - i get continuous reboots.

A: I’ve not experienced this before so cannot really offer much advice. Obviously as a starting point, I’d recommend making sure that the you’re using the latest version of Windows 10 (I believe 1607 is the current version) XenServer 7.0 (and not an earlier version), which is fully up to date (usual support requirements however slightly contradicting my comments further below) and that you’re running the latest GRID drivers. If all of that is true, then we start looking at other options … Have you removed any Ghost devices within the Windows OS, maybe this is causing an issue? If you build a clean, unaltered VM and assign the K260 / K280 vGPU profile without having previously assigned a vGPU profile, does it still reboot? Without actual investigation, it’s all guess work …

Q: The incessant patching, i think we are on Patch 17 for Xen server and we just had a point release for the GRID cards, which as above killed half my workforce with the K260. (K260 works on Windows 7 though…)

A: XenServer Ely (currently in Alpha) has support for "Live-Patching". I believe when this reaches production it will be fully available, so your "incessant patching" schedule should become much easier (You know you don’t have to install every patch Citrix release? If it’s a stable, secure system, leave it alone until you do actually need the patches). Add that to the fact that XenTools is included with Windows Updates rather than Citrix Updates, and you can automate the entire process now without having to reboot any hosts, which is nice :-)

Also, just because NVIDIA release new drivers, doesn’t mean you need to upgrade to them (forcing you to carry out a Master Image update for every Catalog that uses vGPU and simultaneous Host update for every Host that has a GPU installed as the Host and VM drivers must match) if you have a stable platform that is performing as it should. If they are introducing additional performance or bug fixes and you need them, then fair enough.

Always test your updates, especially drivers, on a QA XenDesktop Catalog before pushing it out to your Master Image. I know this can sometimes be difficult, because you need a spare Host so the hypervisor drivers match the VM, but hopefully you have a QA area on your platform you can use for this. Failing that, if you’re using MCS / PVS, just roll it back until you have resolved the issue on your QA Catalog / QA host.

Q: The other 3D apps are quite good, SpaceGass, inventor, Navisworks and Revit all seem to be much better to use than AutoCAD. But they all need 2GB vid cards which is the 260 profile.

A: Excellent, glad you are having some success. Just watch out for those CPU requirements mentioned earlier …

Q: management is easy as you only have the one workstation image to update, you do need to update the personal vDisk on each machine after each update - allow half and hour of downtime for each machine whilst this is done.

A: This is as expected. Faster storage and network will reduce the time this takes.

Q: Network traffic and bandwidth is just weird. We run a LAN setup and some VM run at 60Mbit for about 20 minutes at a time and others sit at around 6. NFI why.

A: When delivering interactive 3D workloads over Citrix that require precision input, the network performance, latency, contention and stability are absolutely critical for a great user experience as Citrix is an adaptive technology… As you are having random bandwidth consumption on an alreadly limited network, this will be worth investigating. As a heads up, Citrix very much prefers a slower, more stable network, compared to a high performing but unstable one. If you have a network that is both high performance and stable, then that’s great. Not in any way suggesting Citrix works better on a slower network, just that the key requirement, is stability! :-)

Something you could try to stop those random bandwidth peaks, set session bandwidth limits in the Citrix policies, this way, the system knows exactly how much bandwidth is has to play with.

Q: Power management and getting VM’s to switch on and off when you want them to is a bit of hit and miss, the DC have power management - but it doesn’t work, so you have to use the shell.

A: Difficult to troubleshoot without seeing it. You mention that your team are Citrix Certified, so I’m going to assume it is all configured properly. If you have Citrix support, despite not knowing about GRID, they should be able to help with this.

Hope that helps with a few of your issues (I now need a coffee and a break, that took a while to write :-) )

Regards

Ben

RachelBerry · November 25, 2016, 6:29pm

Wow thanks for that Ben! :-D

RachelBerry · November 25, 2016, 6:34pm

Hi Virge,

As Ben says I’m a product manager for GRID and I will get some more info together for you next week (I’m in UK so after hours now).

Could you let me know:

what country/area you are in
what version of Win10 - I do know of an issue in Win10 AU and up that is resolved in last weeks GRID 4.1 release (there is a KB article queued to be published on it)
What version of GRID / drivers are you on

Q: Network traffic and bandwidth is just weird. We run a LAN setup and some VM run at 60Mbit for about 20 minutes at a time and others sit at around 6. NFI why.

This sounds like a classic symptom of an anti-virus update in a guest or similar…

Best wishes,
Rachel

BJones · November 25, 2016, 6:40pm

No problem :-)

Virge · November 25, 2016, 8:30pm

Thanks Ben,

Let me digest all that and I will be back to you shortly. Will provide full specs and see if I can bring back the enthusiasm. I must admit that this has broken me.

BJones · November 25, 2016, 8:51pm

Sure no worries, writing all the above nearly broke me too ;-)

Virge · November 25, 2016, 9:45pm

Righto – here we go

Server Spec. best I could buy last year
Dell R730 Host running Xen Server 7.0, running patch 17
Twin 2695v3 CPU with 256GB of RAM and 4TB of RAID 10 @ 15K SAS
Twin K2 video cards on each host
6 NICs in each host, 4 in use, 2 x bonds.
Broadcom cards with NIC offloading turned off on every setting we can find.
LACP load balancing based on IP and Port.

Workstation setup
Basic collection of Lenovo thinkstations. S40, S60 and some older S20 with various Quaddro cards,
Got some new IBM Tiny’s with 16GB and i5 and i7 CPU as well, shared onboard intel video
XEON CPU and at least 12GB of RAM
Running on 10/100 POE daisy chained off their phones. (no Gigabit available)
Basic IBM keyboard and Mouse setups. (happy to change if required)
Windows 7 patched up to October

VM Setup – (Lets assume 10 of the 20 guests are in use at any one time)
We use PVS
We have 2 x PVS server per location and 2 locations with on remote office running on 100Mbit fibre
All VM’s are run locally – No SAN
Each host has

8 x K220 with 4 CPU and 8GB RAM
12 x K240 with 8CPU and 12GB of RAM
Running latest NVidia drivers from a week or so ago.
Windows 7 patched to October
Windows 10 Patch till last week. Anniversary edition.

Citrix versions
Latest and greatest on all 7.11 for everything, we run all authentication through the NetScalers also running latest and greatest firmware. Happy to adjust if you think direct is better.
All systems are run over a LAN setup.

Citrix policy - Now it gets tricky.
I spent a few hours with Magnar going through this. And we came up with below.
DCR disabled
Wallpaper off
Display memory 65536
Dynamic windows off
Extra colour compression disabled
Legacy graphics off
Menu animation allowed
Preferred colour depth 24 bit
Target frame rate 60
Minimnum frame rate 10
Async write disabled
Hardware encoding enabled
Video codec for active changed regions
Show Windows contents disabled
Visual quality HIGH
Network architecture
We have all servers going into a Cisco Gigabit switch
The switch has a single 1GB uplink to 3 other 10/100 POE switches
All workstation run their workstations from the POE switches piggy backed off their VOIP phones
The switches are 3750 catalyst switches and I have no idea how to find a bottleneck of data on those things. Any suggestions here would be brilliant. (I prefer HP models)

The experience is the same if there are 4 people on or 30 people on.

REG Hacks and other stuff
Mouse timer adjusted to 5 or 1 or 10 through group policy, this helped a lot.
Monterey enabled = true
NVENC – pretty sure now done with policy.
Bram’s Remote display analyser will not run on our Windows 7 boxes… it just crashes.
Magnar’s GPU Perf will run, but with 7.11 we get no GPU usage and cant tell if NVENC is enabled.
Mouse cursors in control panel has been set to nothing

AutoCAD setup – we run 2016 and 2017
Options has video acceleration on.
All the other settings are off.
Under display modes. “Apply fill” and all those checkboxes are off.

AutoCAD experience with settings.

2D drawing… turning off graphics acceleration is better than having it on.
3D drawing – turning graphics acceleration is sometimes better than having it on.
CPU use is high with both settings – 30-35% a lot of the time. Even when I give them 8 cores to play with. (this I think is most telling about CPU vs GPU, but can’t confirm with tools.)

CPU Notes.
The 2695v3 was the best I could get last year. It spins up to 3.30Ghz. Surely that is enough for AutoCAD ?

The Windows 10 issue.
We have all the latest and greatest in here. The Guest Tools software really is a POS to install land maintain. It has about 9 PV network drivers installed that if I remove – the system crashes completely.
Guess I will be building yet another Win10 VM

MCS vs PVS
Read heaps on this and as we have local drives and not SAN, we went PVS, it was also easier to maintain and update.

With regards to Updates
We don’t have a stable system; we never had had one. So we keep hoping the next update will get us to a level and stable release so we can just leave the bloody thing alone. The black screen thing with K220 with the last NVidia update killed our entire team. Worked on our base 240 test machine but ALL our 220 boxes crapped out.
If we can get a stable and usable system working… I won’t be touching this ever again. Period.

BJones · November 26, 2016, 12:38pm

Firstly, apologies, it’s another essay! … Grab a coffee before you get started …

Thanks for the additional system information, much appreciated and it is all extremely relevant, right down to the user peripherals! It is all part of the system!

So, reading back through right from the top so we know where we are:

Citrix Support

As said will focus on Citrix technologies primarily. They may have specific knowledge about other technologies in your stack, but it’s not always a certainty and there are a lot of times when the customer is actually doing more advanced things than the vendor and the vendor simply doesn’t know the answer to an issue. Also, depending on who picks up the phone, answers your email or forum post, you will get a different response (you shouldn’t, but in the real world, you do). I’ve provided insight into other locations for support and information, so I hope you now have additional sources to help with any issues, and no, you’re not on your own when you impliment this stuff ;-)

NVIDIA GRID Support

Kepler GPUs require you to go back to your place of purchase in the first instance for hardware support. That place of purchase may also be able to offer support on configurations and usage, unless they’re just a hardware vendor. As with Citrix support above, you now have information on where you can get support and find additional references. If you require direct NVIDIA Support, you’re going to need those Maxwell (or newer) GPUs with SUMs.

GRID K2 not being powerful enough.

As a first generation technology, it has its scalability and performance limits. NVIDIA have seen the limitations of Kepler, listened to their customers and done an absolutely cracking job with the second generation Maxwell architecture in increasing those limits and adding features and functionality. Wait till you see what Gen3 (Pascal) can do!!.. If you need more GPU power but need to keep the density, you’ll need to upgrade to M60s and yes they do work in an R730. Or, you can hold out for the P60s when they eventually get announced at some point (we all know they’re coming, but I’ve no idea when)) …

GRID Software being buggy

I’m not sure on this one. I think you need to do some internal testing to make sure the issue is repeatable on a clean build W10 just to make sure it isn’t an image issue. I’m not saying this is an isolated issue, but I haven’t heard of it before, maybe others who are reading this thread have done, in which case, please let the community know so NVIDIA can investigate. That said, the drivers were only released a week or so ago, maybe more cases will appear. However, if you have now lost vGPU profiles that are required, I suggest like any other update that has been unsuccessful, you roll it back to the previous PVS image until you have isolated the cause.

Managing XenServer updates

These should be better with the next Ely release, that fix is on it’s way. Until then, unless it specifically says it will fix your issue in the release notes, is a functionality or security patch you need, there’s no immediate rush to install them. Likewise with the NVIDIA drivers, unless they give you stability, required performance, bug or security fix, there’s no rush to install them.

The 3D Apps that need 2GB profiles

When you have your K260 profile back, this will be resolved. Any additional frame-buffer requirements and you will need to scale up (M60) or scale out, purchase additional R730s of an equal spec to what you have, that said, the K2 may not be avaiable for much longer.

Management of your platform.

You only have 1 Master vDisk to maintain and update. Hopefully you should have 3 vDisks for this purpose; Past, Present, Future (Think of it as a GFS disk rotation). This gives you an easy roll back and a granular way of introducing updates into the platform. Because of the way in which PVS and MCS work, there is no reason to hit all users with the same update at the same time, in fact this is something I strongly discourage for obvious reasons. Using GRID in a platform adds another level of complexity to the update process, as the GRID drivers in the XenServer (or ESXi) Host, and the GRID drivers in your vDisk must match, so must be updated at the same time.

You could do this in a couple of ways. You can do everything at once and hope for the best, or you can introduce the updates in a granular way and assess differences between the updated image and previous image. There are different ways in which you can control VM startup location. You can either limit the vGPU profiles on a XenServer Host in XenCenter, or you could run multiple XenServer Pools with XenDesktop Catalogs assigned to each. Both XenServer Pools would be identical in terms of capacity, performance and vGPU configuration, but Pool 1 would have it’s Hosts updated first with the updated vDisk being assigned to those VMs XenDesktop Catalog, followed by Pool 2 and the second Catalog after testing the changes has been successful. Something along those lines, lots of options to play with.

Network traffic and bandwidth

Covered off above and as Rachel suggests, is it the VM that is creating the bandwidth or the underlying endpoint doing some sort of update. If it’s the VM, Citrix session policies may be able to help, if it’s the endpoint device then you’ll need to investigate and take appropriate action.

Power Management

As mentioned above, check Delivery Group power settings to make sure they are correct. Also, make sure the hosts have the correct vGPU profiles assigned as GRID uses a “Depth First” approach for VM placement, meaning that you could run out of appropriate locations to start VMs with differing vGPU profiles.

Something to try, create a load of dummy VMs without GPUs assigned, setup a temporary Catalog and Delivery Group and test the Power On setting. As they have no vGPUs assigned, XenServer should load balance them across the entire Pool. If this is successful, then you know it’s not a Power On issue and can look at other potential causes.

Right, I believe that covers off the top section and should hopefully give you some ideas to investigate.

So, your users don’t appear to be experiencing any massive performance issues, or you haven’t mentioned any, just a poor interactive experience due to latency, and also K260 profile becoming unavailable with the most recent GRID driver update (the K260 profile I believe we’ve dealt with above, and you can either roll back or try a clean build VM to validate the issue, then post back confirming results).

Just going through your details as listed above:

Server Spec:

Looks ok, although I’d need to understand your PVS architecture to know if your local storage is a bottleneck. It’s unusual to not see any Flash based technology.

Workstation Setup:

Those look fine apart from the Network speed which we’ve already covered.
Are you running the latest Citrix Receiver and have you manually enabled Hardware Decode so it uses the GPU not CPU? (You do need to manually enable it, as it is off by default)
If you have any that you can’t enable Hardware Decode on, you need a fast CPU, again, 3.0Ghz+ to handle the decode.

As your users are CAD users, I would highly recommend you evaluate some optimized peripherals to remove any local lag through non-optimized devices. Because of the way CAD users work, Mouse responsiveness and interaction is critical and the CAD users are particularly sensitive to latency, so we need to take every step to remove as much as possible. I have personally used both of these and although I do not use CAD, I can validate how good they are in terms of precision and responsiveness:

CAD Mouse: 3Dconnexion UK - SpaceMouse, CadMouse, Drivers
3D Space Mouse: 3Dconnexion UK - SpaceMouse, CadMouse, Drivers

Do not try to use the SpaceMouse Pro or Enterprise (which is why I haven’t linked them). Although the Mouse will function in terms of movement, the keys won’t work properly. This is due to a difference in the way that 3DConnexion create their USB and the way Citrix maps it. There is a much more technical answer to that, but I’d need to speak to my contacts in Citrix to get it.

The 3D SpaceMouse will require a driver to be installed in the Master Image and you’ll have to open up USB Passthrough on your Citrix Policy. The standard Mouse will work without issue.

These are high precision professional devices, and the difference between them and a generic mouse is like night and day. The SpaceMouse will also give you 6 degrees of movement, which your CAD users may appreciate if they do not already have it.

VM Setup:

The only thing to be wary of is that CPU over-commit and Clock Speed. Remember, you’re doing Workstation replacement, not VDI, and they are not the same. They are spec’d and designed for differently. However, you’re not reporting any outright performance issues, so this looks ok. Be aware though that resource contention can cause what users perceive as latency, so we may need to come back to this at some point.

Citrix Versions

7.11 is great.
Netscalers - Are they physical or VPXs? (what model / throughput license are they?)
Have you carried out any tuning on them?
Are you using Insight to track network / session latency?

Citrix Policy

It’s always difficult recommending Citrix Policies as no 2 environments are the same and they all have their own characteristics. Any that I recommend for you may well suck when you test them as I have no experience of your environment. This ideally needs to be done on site, but you have 5 Citrix Certified guys so they should know what they’re doing. Also, if you’ve been through this with Magnar (Johnsen?) then I’m sure you have the best Policy for your environment, as he is very good.

Here’s a Citrix Policy that I used to push a Windows 7 XenDesktop across from one country to another. Spec of the VM was 8x 3.4Ghz CPU (base Clock), 16GB RAM and 2GB vGPU from an M60 and it had AutoCAD and Inventor 2017 (Fully patched) and this was delivered though a pair of Netscaler SDX out to the internet. The customer was using a small (I think it was a) HP tower, with NVIDIA 2GB GPU, 16GB RAM and a 3.4Ghz CPU and had 2x 1080P monitors. He also used the peripherals I recommended earlier. The idea was that we try to replace his CAD workstation with a VM. As said, this was over the internet to a different country, and although our platform has physical Netscalers and a very large connection, it still breaks out onto the internet, where we have no control. The Policy used was as follows:

Visual Quality – Build to Lossless
Allow Visually Lossless Compression - Enabled
Use Hardware Encoding for Video Codec - Enabled
Use Video Codec for Compression - Use When Preffered
Target Frame Rate – 60fps
Client USB Device Redirection – Allowed
Client USB Plug and Play device redirection - Allowed
View Window Content While Dragging – Disabled

I don’t like posting Citrix Policies, because everyone thinks they should work for their scenario and they start over analyzing why certain settings have been used or not used, when as mentioned above, they typically require tuning for individual circumstances (which is why Citrix only list a couple of them as templates) as there is so much misunderstanding about how and when to use them.

Windows 10 only requires a couple of policies as it works in a different way to Windows 7, hence I would not apply exactly the same to Windows 10.

When tested, the visual experience was identical to the workstation sitting under the desk, the main difference, was that our VMs and data run on an All Flash SAN, so the data load times were just incomparable to what they were using. Needless to say, it was a far superior experience. Just to add, I would not deploy that configuration in a production environment, I just wanted to show the customer what the platform and technology was capable of. As for Latency, there’s no getting around the distance, it was there, but it was such a tiny difference, that they had absolutely no issues with it.

Anyway, that policy is there for you to try if you would like to. Moving on! …

Reg Hacks and Stuff

Mouse Setting I set as 1. Don’t care about the additional bandwidth. The response time is worth the overhead.
The other stuff is fine and I’m sure the developers are working to fix those bugs.

AutoCAD 2016 / 2017

Make sure you have all Service Packs and Updates applied to these. These updates for AutoCAD make a big difference to Mouse performance!

AutoCAD Experience

As mentioned, it’s worth reading what enabling hardware acceleration actually gives you. A lot of these programs still rely heavily on fast CPU, which is why when you look at the system requirements, they don’t make too much of a fuss about GPU, but do ask for a CPU with a high Clock.

CPU Notes

TurboBoost is a complicated topic, and you need to look at "Maximum Boost" and "All Core Boost" to understand what you are going to be getting as they give considerably different results. Your CPUs base is 2.3Ghz, it’s Maximum Boost is 3.3Ghz, however, its All Core Boost is only 2.8Ghz.

There are specific conditions for each mode to kick in, so you get variable performance. Personally, I never rely on TurboBoost and always spec the CPUs according to their base frequency, not their Boost. This gives me a known high level of performance out the box, anything in addition to that is a bonus and this means that everything within that Operating System I use runs at a fast base rate.

Make sure the server hardware can give you the experience your users require before you start adding protocols to it. I take it you’ve been through the BIOS and changed everything from Balanced / Economy to Maximum Performance? You will probably need to modify the cooling policy as well to Maximum Performance (TurboBoost has thermal thresholds, the more cooling you can give the server / CPUs, the bigger the TurboBoost thermal window). You mentioned about the CPUs TurboBoost to 3.3Ghz, however, have you actually monitored it to see if it does boost that high or at all?

As for tuning XenServer, with 6.5, you had to tune various factors as it wasn’t set for “Performance Mode” out the box. I believe with XenServer 7, it should now be set for Performance by default, however it’s still worth checking to familiarise yourself:

http://xenserver.org/partners/developing-products-for-xenserver/19-dev-help/138-xs-dev-perf-turbo.html

The Windows 10 Issue

XenTools, I’m unsure why you’re trying to remove anything from them unless it’s causing you issues, in which case, raise it with Citrix Support directly. Personally, I’ve never removed anything XenTools has installed, I just let it do it’s thing and don’t have any issues with it. I’m sure you’re aware of what happens when you update network drivers on a PVS disk … This doesn’t happen with MCS, which is one reason I prefer it. There are other options to do it, but MCS is far easier to push out updates, and you don’t need any additional infrastructure, resource or Operating System licenses to support it.

Have you been through the Windows Device Manager, enabled “Show Hidden Devices” and removed all the Ghost adaptors? Do this after the GPU drivers and VDA have been installed, not before.

MCS vs PVS

What’s the PVS Spec? (CPU / RAM / Network)
Do you run PVS Virtually or Physically?
Where are your vDisks stored?
Are you using any RAM Caching / Any IO Acceleration?

General

I mentioned (just above) about checking performance before you start adding protocols, have you tried accessing the VMs outside of Citrix? This removes the Netscaler, Storefront, the ICA / HDX Protocols and Citrix Policies at the same time. Try connecting with another Protocol and see what kind of results you get. Make sure it’s a fair test, if you have capacity, use a host that has no other workloads or users on it, try to add some consistency to the testing. When you’ve done that, then connect in your normal way and compare the differences. We’re trying to see where the latency is coming from and if the Protocol or access methods are causing it.

If there is a difference, then remove the Netscalers and connect to Storefront directly, see how you get on.

Right, I need another Coffee!

Regards

Ben

Virge · November 27, 2016, 5:25am

Just a champion effort mate and it is much appreciated.

I have spent the last two days building a clean Win7 master image. I at least have all the cool tools from Bram and Magnar now working.

2D seems to be better over the 40M/s link i have, 3D is "OK" but i shall ramp up a few more GPU and CPU’s and see what happens.

Thanks for the all of the above. I shall spend the week putting what you have written above into practice and come back with a report next weekend.

Again… MASSIVE thanks

BJones · November 27, 2016, 10:27am

No problem at all, hopefully something in my ramblings will help.

This virtualising as much as possible lark is sposed to make our lives easier, be less stressful and free us up to do other more important things (like drink more Tea & Coffee!) but when it goes wrong, it can be difficult to troubleshoot as there are now so many components that all depend heavily on each other to deliver the expected performance and experience, and if one component has an issue, the symptoms can manifest themselves in strange ways.

Latency can be a difficult thing to resolve, so when encountered, I typically start off by bypassing large chunks of the system to find out where to look, then narrow it down, which is why I suggested removing the Protocol and access method at the same time, it’s also very easy and non-destructive, but the results can be quite surprising as to what the issue is.

Just on your Master Base Image build, are you using Citrix AppDisks? …

Don’t forget to evaulate those high-end peripherals. I’m not into pushing products on people, but they’re beautifully made, really nice to use and give really nice interaction with the 3D models. Your users may really appreciate them, see if you can get a couple on demo :-)

Any issues, get back to us on here and we’ll work through them together.

(Look at that, I managed to not write another book! …)

Regards

Ben

RachelBerry · November 28, 2016, 11:18am

For policies, Citrix advise users start with the Very High User Experience template Group Policy management template updates for XenApp and XenDesktop
(ctrl-f on "gpu") and you’ll find the recommendation that if you have a GPU you probably want to raise frame rate above 24/30 to 60fps as a tweak on top of the template.

For WAN / low bandwidth you might want to explore other templates https://www.citrix.com/blogs/2015/10/28/simplify-hdx-policy-administration-to-amplify-the-user-experience/

Virge · December 4, 2016, 11:52pm

Post Mortem from the rebuild.

What we did
• Rebuild the maser image with latest guest tools and all the Autodesk software
• Removed all the Windows 10 Images, Rebuild was on Windows 7 only
• NVidia Driver 369.71 (November 2016)
• K2 cards broken up to K240 profiles
• Citrix on 7.11 for everything, XEN Server 7.0 patched to patch #17
• Still working across LAN.
• N2VENC is enabled
• Thinwire plus H.264 Codec – policies as listed above

What’s better
• Lag is much better with the new image, not sure why, but newer image is better. (newer XENTools/GuestTools version ???)
• Boot times are better, PVS is using less network traffic after start-up
• Less Graphics compression even though we use the same profile…
• ALL VERSIONS of Autodesk are latest patches, including Hotfix 4, the Citrix patch

The bad stuff
• Lossless is pretty much un usable. The lag with Lossless Codec is way worse on the new image.
• K220 and K260 install won’t use the graphics card in AutoCAD, even after reboot. Each machine keeps trying to install its video card over and over and over again. We can sort of fudge this with disabling items but any clean image needs this all reset.
• Power management of Citrix is dead set witchcraft, we have some machines that boot continuously and others that only boot on demand, no matter what setting we use in both the shell and the Studio.
• Every second or third VM we build has a PVDisk error, “Error #6: The personal vDisk could not find a disk attached to the system for storing data”. We just delete these and create more and we use the ones that work …
• If we turn video compression on with policy, ie: where preferred, the compression is overdone. It gets compressed to something that is unusable. But works fine will little to no lag on the “HIGH” setting.

Conclusion.
We are better off with the new image, my guess is that new XENtools gave the machines a bit more bandwidth to facilitate the lag issue. Also removing all ghost devices manually would have helped. We have noticed the retries are down on the PVS boxes which also leads me to the XENTools helping out.

Not quite sure why compression seems to have gone crazy, I will play a bit more with this over the next week and through Xmas.

Thanks for all your help so far.

BJones · December 5, 2016, 2:17pm

Hi Virge

Great job, sounds like you’re on the road to recovery!

Some things to have a look at …

Lossless Compression

When you’re trying this, make sure you enable "Allow Visually Lossless Compression" as well. The best way I can describe it when this is not enabled and you’re using Lossless, is that the image you’re trying to move feels really heavy, which makes it unusable, when you’re obviously looking for a nice light feeling.

To see if the lossless setting actually makes a visual difference, purely in terms of colour and clarity, on the task bar by the clock, there’s a Lossless Switch that the users can enable / disable on the fly by right clicking it. Enable this during a session and see if your users can notice any differences. If they can’t, then you know this isn’t something worth pursuing. When I enabled it for the above use case, the line colour and clarity in AutoCAD was identical to a native workstation, whereas without, the colour was slightly off with magenta lines, and the lines were slightly hazy and not crisp and sharp like they should be. But as said, each environment is different, and you may not need this setting if your users are happy.

vGPU Profiles

With the K220, K240 and K260 can you please clarify something for me. How are you managing the differing vGPU profiles? Are you building the Master Image with the K240, then changing the vGPU profile after you have created the Master Image to give you the various VM specs?

Power-On policy

Were you able to try creating some test VMs without GPUs assigned and start them? If yes, what were the results?

Are there only certain hosts that allow VMs to start? Is there anything in the Logs that indicate an attempt to start them or a failed attempt?

PvDisks

I’m unsure and would need to investigate. Are you using the latest version of PVS?

Compression

Sometimes with the Citrix Policies, it can be worth removing them (disable, not delete) and starting again, otherwise you can get drawn in to applying too many settings and wonder why they’re not working (it’s easily done).

As I said further up, Citrix prefers a stable network, so it may be worth investigating to make sure the network is stable and everything is as it should be, otherwise it is constantly trying to adapt and you’ll get varying levels of quality. With a stable network, you should be able to establish a solid baseline and then build on it.

Are you able to confirm whether the Netscalers are MPX, SDX or VPX and whether you’ve tuned them or not? If they are VPX, can you please confirm the spec of them (CPU (Cores & Clock), RAM, Network and throughput license) and whether they’re on a shared (if so what is sharing the host?) or dedicated host.

Depending on how the Gateway was created (Manually or through the Wizard), you have the option to allow the configuration to apply optimizations to it, was this done or not?

Have you tried bypassing the Netscalers and going to Storefront directly? Also, try connecting with RDP to one of the XDs and compare the difference to ICA.

Obviously I don’t know what your in-house Netscaler skills are, but the default Netscaler TCP Profile should be replaced with a workload specific one. Here’s a couple of guides about tuning the TCP Profiles for XenApp / XenDesktop / ICA:

If you need more information about Netscaler tuning (which is an area that is over looked more often than not) then just google "XenDesktop Netscaler Tuning" (or similar) and review the results, there is definitely additional performance to be found by tuning them correctly. As mentioned at the start, it’s all part of the end-to-end system :-)

The reason I’m interested in the Netscaler so much, is that everything you have runs through it, so a misconfiguration or miscalculation of some sort could potentially cause issues, regardless of what you configure behind it.

There are still so many variables that I don’t know about your environment that can all contribute to the performance. But it at least sounds like you’re making progress to get to a level that you’re happier with.

Keep us posted!

Regards

Ben

Virge · December 15, 2016, 3:52am

More info to come your way shortly…

Quick Q on the Citrix Desktop Manager - have read a bit about it online and it look slike a nice update to AutoCAD

Thoughts ??

https://www.yorkshirecloud.co.uk/solved-poor-autocad-performance-in-citrix-xenapp/

BJones · December 15, 2016, 9:19am

I’ve never seen the issue they mention, and that article is 4 years old (a lifetime in terms of technology and updates), I’m not even sure it’s relevant now. If there was a wider issue with Receiver, I’m sure Citrix would have fixed it.

Which issues are you still experiencing?

RasmusRaunNielsen · January 4, 2017, 8:30pm

Hi everyone

Just curious: Have you tested the experience without piggy-backing network on the phone? In a former job we had a lot of issues with piggy-backed networking on phones acting… well, "sub-optimal" is the nicest word I can come up with.

PS: BJones - great work!!

Cheers!

Virge · January 23, 2017, 3:33am

Hi Rasmus,

Yes - we are trying that now. Also bumping a few up to direct Gigabit.

Quick update was that we moved a bunch of them to XenApp and the mouse was a bit better. But we now have endless crashing of almost all apps.

So guessing there is a setup issue here for temporary files.

Anyone got an idea how AutoDesk and Solidworks run temp files… we had personal vDisks on XENDesktop - and we are only guessing, but seeing as all of them are crashing in a lot of apps.

Just struggling for information as we seem to be on the cutting edge of this kind of setup. I hear lots of people say that they can run this perfectly from 200 miles away and the end user doesnt know.

We cant run it locally without it either being laggy or crashing. So something has to be very wrong here.

Topic		Replies	Views
Physical GPU shared between user/license types NVIDIA Virtual GPU Technology	35	45533	April 30, 2017
vGPU for AutoCAD/RDSH questions General Discussion	17	12319	December 10, 2020
Grid 2.0 General Discussion	95	136343	June 30, 2016
Successful XenApp GRID deployments XenApp	21	43183	October 6, 2014
Suitable GRID NVIDIA Virtual GPU Technology	14	14916	June 14, 2017
NVIDIA 364.12 release: Vulkan, GLVND, DRM KMS, and EGLStreams Linux	70	114464	September 26, 2021
XenDesktop 7.6 - Autocad mouse issues XenDesktop	18	36549	March 10, 2020
vGPU Utilization Per VM NVIDIA Virtual GPU Technology	22	39423	August 25, 2016
Looking for advice on optimal config for latest-gen Citrix Xenapp vGPU solution XenApp	67	20867	March 20, 2024
GPU timeout \| lockup Linux	14	1257	July 7, 2024

=======

Related topics