Firstly, apologies, it’s another essay! … Grab a coffee before you get started …
Thanks for the additional system information, much appreciated and it is all extremely relevant, right down to the user peripherals! It is all part of the system!
So, reading back through right from the top so we know where we are:
As said will focus on Citrix technologies primarily. They may have specific knowledge about other technologies in your stack, but it’s not always a certainty and there are a lot of times when the customer is actually doing more advanced things than the vendor and the vendor simply doesn’t know the answer to an issue. Also, depending on who picks up the phone, answers your email or forum post, you will get a different response (you shouldn’t, but in the real world, you do). I’ve provided insight into other locations for support and information, so I hope you now have additional sources to help with any issues, and no, you’re not on your own when you impliment this stuff ;-)
NVIDIA GRID Support
Kepler GPUs require you to go back to your place of purchase in the first instance for hardware support. That place of purchase may also be able to offer support on configurations and usage, unless they’re just a hardware vendor. As with Citrix support above, you now have information on where you can get support and find additional references. If you require direct NVIDIA Support, you’re going to need those Maxwell (or newer) GPUs with SUMs.
GRID K2 not being powerful enough.
As a first generation technology, it has its scalability and performance limits. NVIDIA have seen the limitations of Kepler, listened to their customers and done an absolutely cracking job with the second generation Maxwell architecture in increasing those limits and adding features and functionality. Wait till you see what Gen3 (Pascal) can do!!.. If you need more GPU power but need to keep the density, you’ll need to upgrade to M60s and yes they do work in an R730. Or, you can hold out for the P60s when they eventually get announced at some point (we all know they’re coming, but I’ve no idea when)) …
GRID Software being buggy
I’m not sure on this one. I think you need to do some internal testing to make sure the issue is repeatable on a clean build W10 just to make sure it isn’t an image issue. I’m not saying this is an isolated issue, but I haven’t heard of it before, maybe others who are reading this thread have done, in which case, please let the community know so NVIDIA can investigate. That said, the drivers were only released a week or so ago, maybe more cases will appear. However, if you have now lost vGPU profiles that are required, I suggest like any other update that has been unsuccessful, you roll it back to the previous PVS image until you have isolated the cause.
Managing XenServer updates
These should be better with the next Ely release, that fix is on it’s way. Until then, unless it specifically says it will fix your issue in the release notes, is a functionality or security patch you need, there’s no immediate rush to install them. Likewise with the NVIDIA drivers, unless they give you stability, required performance, bug or security fix, there’s no rush to install them.
The 3D Apps that need 2GB profiles
When you have your K260 profile back, this will be resolved. Any additional frame-buffer requirements and you will need to scale up (M60) or scale out, purchase additional R730s of an equal spec to what you have, that said, the K2 may not be avaiable for much longer.
Management of your platform.
You only have 1 Master vDisk to maintain and update. Hopefully you should have 3 vDisks for this purpose; Past, Present, Future (Think of it as a GFS disk rotation). This gives you an easy roll back and a granular way of introducing updates into the platform. Because of the way in which PVS and MCS work, there is no reason to hit all users with the same update at the same time, in fact this is something I strongly discourage for obvious reasons. Using GRID in a platform adds another level of complexity to the update process, as the GRID drivers in the XenServer (or ESXi) Host, and the GRID drivers in your vDisk must match, so must be updated at the same time.
You could do this in a couple of ways. You can do everything at once and hope for the best, or you can introduce the updates in a granular way and assess differences between the updated image and previous image. There are different ways in which you can control VM startup location. You can either limit the vGPU profiles on a XenServer Host in XenCenter, or you could run multiple XenServer Pools with XenDesktop Catalogs assigned to each. Both XenServer Pools would be identical in terms of capacity, performance and vGPU configuration, but Pool 1 would have it’s Hosts updated first with the updated vDisk being assigned to those VMs XenDesktop Catalog, followed by Pool 2 and the second Catalog after testing the changes has been successful. Something along those lines, lots of options to play with.
Network traffic and bandwidth
Covered off above and as Rachel suggests, is it the VM that is creating the bandwidth or the underlying endpoint doing some sort of update. If it’s the VM, Citrix session policies may be able to help, if it’s the endpoint device then you’ll need to investigate and take appropriate action.
As mentioned above, check Delivery Group power settings to make sure they are correct. Also, make sure the hosts have the correct vGPU profiles assigned as GRID uses a “Depth First” approach for VM placement, meaning that you could run out of appropriate locations to start VMs with differing vGPU profiles.
Something to try, create a load of dummy VMs without GPUs assigned, setup a temporary Catalog and Delivery Group and test the Power On setting. As they have no vGPUs assigned, XenServer should load balance them across the entire Pool. If this is successful, then you know it’s not a Power On issue and can look at other potential causes.
Right, I believe that covers off the top section and should hopefully give you some ideas to investigate.
So, your users don’t appear to be experiencing any massive performance issues, or you haven’t mentioned any, just a poor interactive experience due to latency, and also K260 profile becoming unavailable with the most recent GRID driver update (the K260 profile I believe we’ve dealt with above, and you can either roll back or try a clean build VM to validate the issue, then post back confirming results).
Just going through your details as listed above:
- Looks ok, although I’d need to understand your PVS architecture to know if your local storage is a bottleneck. It’s unusual to not see any Flash based technology.
- Those look fine apart from the Network speed which we’ve already covered.
- Are you running the latest Citrix Receiver and have you manually enabled Hardware Decode so it uses the GPU not CPU? (You do need to manually enable it, as it is off by default)
- If you have any that you can’t enable Hardware Decode on, you need a fast CPU, again, 3.0Ghz+ to handle the decode.
As your users are CAD users, I would highly recommend you evaluate some optimized peripherals to remove any local lag through non-optimized devices. Because of the way CAD users work, Mouse responsiveness and interaction is critical and the CAD users are particularly sensitive to latency, so we need to take every step to remove as much as possible. I have personally used both of these and although I do not use CAD, I can validate how good they are in terms of precision and responsiveness:
CAD Mouse: http://www.3dconnexion.co.uk/products/cadmouse.html
3D Space Mouse: http://www.3dconnexion.co.uk/products/spacemouse/spacenavigator.html
Do not try to use the SpaceMouse Pro or Enterprise (which is why I haven’t linked them). Although the Mouse will function in terms of movement, the keys won’t work properly. This is due to a difference in the way that 3DConnexion create their USB and the way Citrix maps it. There is a much more technical answer to that, but I’d need to speak to my contacts in Citrix to get it.
The 3D SpaceMouse will require a driver to be installed in the Master Image and you’ll have to open up USB Passthrough on your Citrix Policy. The standard Mouse will work without issue.
These are high precision professional devices, and the difference between them and a generic mouse is like night and day. The SpaceMouse will also give you 6 degrees of movement, which your CAD users may appreciate if they do not already have it.
The only thing to be wary of is that CPU over-commit and Clock Speed. Remember, you’re doing Workstation replacement, not VDI, and they are not the same. They are spec’d and designed for differently. However, you’re not reporting any outright performance issues, so this looks ok. Be aware though that resource contention can cause what users perceive as latency, so we may need to come back to this at some point.
- 7.11 is great.
- Netscalers - Are they physical or VPXs? (what model / throughput license are they?)
- Have you carried out any tuning on them?
- Are you using Insight to track network / session latency?
It’s always difficult recommending Citrix Policies as no 2 environments are the same and they all have their own characteristics. Any that I recommend for you may well suck when you test them as I have no experience of your environment. This ideally needs to be done on site, but you have 5 Citrix Certified guys so they should know what they’re doing. Also, if you’ve been through this with Magnar (Johnsen?) then I’m sure you have the best Policy for your environment, as he is very good.
Here’s a Citrix Policy that I used to push a Windows 7 XenDesktop across from one country to another. Spec of the VM was 8x 3.4Ghz CPU (base Clock), 16GB RAM and 2GB vGPU from an M60 and it had AutoCAD and Inventor 2017 (Fully patched) and this was delivered though a pair of Netscaler SDX out to the internet. The customer was using a small (I think it was a) HP tower, with NVIDIA 2GB GPU, 16GB RAM and a 3.4Ghz CPU and had 2x 1080P monitors. He also used the peripherals I recommended earlier. The idea was that we try to replace his CAD workstation with a VM. As said, this was over the internet to a different country, and although our platform has physical Netscalers and a very large connection, it still breaks out onto the internet, where we have no control. The Policy used was as follows:
Visual Quality – Build to Lossless
Allow Visually Lossless Compression - Enabled
Use Hardware Encoding for Video Codec - Enabled
Use Video Codec for Compression - Use When Preffered
Target Frame Rate – 60fps
Client USB Device Redirection – Allowed
Client USB Plug and Play device redirection - Allowed
View Window Content While Dragging – Disabled
I don’t like posting Citrix Policies, because everyone thinks they should work for their scenario and they start over analyzing why certain settings have been used or not used, when as mentioned above, they typically require tuning for individual circumstances (which is why Citrix only list a couple of them as templates) as there is so much misunderstanding about how and when to use them.
Windows 10 only requires a couple of policies as it works in a different way to Windows 7, hence I would not apply exactly the same to Windows 10.
When tested, the visual experience was identical to the workstation sitting under the desk, the main difference, was that our VMs and data run on an All Flash SAN, so the data load times were just incomparable to what they were using. Needless to say, it was a far superior experience. Just to add, I would not deploy that configuration in a production environment, I just wanted to show the customer what the platform and technology was capable of. As for Latency, there’s no getting around the distance, it was there, but it was such a tiny difference, that they had absolutely no issues with it.
Anyway, that policy is there for you to try if you would like to. Moving on! …
Reg Hacks and Stuff
Mouse Setting I set as 1. Don’t care about the additional bandwidth. The response time is worth the overhead.
The other stuff is fine and I’m sure the developers are working to fix those bugs.
AutoCAD 2016 / 2017
Make sure you have all Service Packs and Updates applied to these. These updates for AutoCAD make a big difference to Mouse performance!
As mentioned, it’s worth reading what enabling hardware acceleration actually gives you. A lot of these programs still rely heavily on fast CPU, which is why when you look at the system requirements, they don’t make too much of a fuss about GPU, but do ask for a CPU with a high Clock.
TurboBoost is a complicated topic, and you need to look at “Maximum Boost” and “All Core Boost” to understand what you are going to be getting as they give considerably different results. Your CPUs base is 2.3Ghz, it’s Maximum Boost is 3.3Ghz, however, its All Core Boost is only 2.8Ghz.
There are specific conditions for each mode to kick in, so you get variable performance. Personally, I never rely on TurboBoost and always spec the CPUs according to their base frequency, not their Boost. This gives me a known high level of performance out the box, anything in addition to that is a bonus and this means that everything within that Operating System I use runs at a fast base rate.
Make sure the server hardware can give you the experience your users require before you start adding protocols to it. I take it you’ve been through the BIOS and changed everything from Balanced / Economy to Maximum Performance? You will probably need to modify the cooling policy as well to Maximum Performance (TurboBoost has thermal thresholds, the more cooling you can give the server / CPUs, the bigger the TurboBoost thermal window). You mentioned about the CPUs TurboBoost to 3.3Ghz, however, have you actually monitored it to see if it does boost that high or at all?
As for tuning XenServer, with 6.5, you had to tune various factors as it wasn’t set for “Performance Mode” out the box. I believe with XenServer 7, it should now be set for Performance by default, however it’s still worth checking to familiarise yourself:
The Windows 10 Issue
XenTools, I’m unsure why you’re trying to remove anything from them unless it’s causing you issues, in which case, raise it with Citrix Support directly. Personally, I’ve never removed anything XenTools has installed, I just let it do it’s thing and don’t have any issues with it. I’m sure you’re aware of what happens when you update network drivers on a PVS disk … This doesn’t happen with MCS, which is one reason I prefer it. There are other options to do it, but MCS is far easier to push out updates, and you don’t need any additional infrastructure, resource or Operating System licenses to support it.
Have you been through the Windows Device Manager, enabled “Show Hidden Devices” and removed all the Ghost adaptors? Do this after the GPU drivers and VDA have been installed, not before.
MCS vs PVS
- What’s the PVS Spec? (CPU / RAM / Network)
- Do you run PVS Virtually or Physically?
- Where are your vDisks stored?
- Are you using any RAM Caching / Any IO Acceleration?
I mentioned (just above) about checking performance before you start adding protocols, have you tried accessing the VMs outside of Citrix? This removes the Netscaler, Storefront, the ICA / HDX Protocols and Citrix Policies at the same time. Try connecting with another Protocol and see what kind of results you get. Make sure it’s a fair test, if you have capacity, use a host that has no other workloads or users on it, try to add some consistency to the testing. When you’ve done that, then connect in your normal way and compare the differences. We’re trying to see where the latency is coming from and if the Protocol or access methods are causing it.
If there is a difference, then remove the Netscalers and connect to Storefront directly, see how you get on.
Right, I need another Coffee!