I am running a VMWare 6.5 Farm with about 120 Grid Accelerated VDI.
6 HP WS460c Blades with 4 x Tesla M6 in a C7000 HP Blade Chassis
VMWare 6.5 patched up to current
120 Windows 10 VDI using LTSB 2016 (1607)
Citrix Xendesktop 7.18
MCS image - Non Persistent VDI
Grid M6-1B Profiles on each VM
Our primary office site is connected to a Co-Located data center by redundant 1Gb links. We typically see Latency and Round Trip time in session between 4-8ms. Each VM has 4 vCPU and 16GB of RAM, along with the GPU, and the disk runs on the VMWare Paravirtual Controller. Our Citrix Policy is setup to use the video codec on actively changing regions on high quality. We are using NVENC for moving images within the session.
At random points throughout the day, users complain about their VDI freezing for less than a second, and then resuming. It’s very obvious when they are running AV materials, because both audio and video pause and then resume like nothing happened. Sometimes this freeze will cause the audio to desync from an online video on sites such as youtube or vimeo. Typing will also be delayed, and a number of words will suddenly run across the screen. All users see the issues occur, but it’s not something that happens at the same time to everyone.
Storage is an all flash array, and as far as I can tell average read and write latency are under 1ms. Network load seems low, and according to our monitoring tool, we aren’t seeing dropped packets or spikes in round trip latency on our routers. CPU load on the VM’s and in the esxi host itself looks entirely within tolerable limits.
We have one additional host that is used primarily for testing and updating our images, and upgraded it to the latest Grid 7.2 drivers, as we are on 5.x in production right now. This did not have a meaningful impact on the issue. The worst part about the problem is that I have yet to devise a method of reliably recreating the issue. I can hammer on a session with multiple videos and applications running simultaneously, and they won’t flinch. Yet other users complain that they have to watch their typing catch up with them multiple times per hour.
When the issue occurs, the entire VM appears to freeze momentarily, and then resume. If anyone has any insight, I would greatly appreciate it.