vGPU optimization (NUMA, vCPU, etc.)

Hello all,

I received two requests on how to optimize the GPU performance in a vGPU environment and thought that this would be a great thread on our GRID forum where everyone can add their experience in how they were able to fine tune a vGPU environment.

Let me start the thread with a couple public sources

  • The NVIDIA vGPU Driver download includes the NVIDIA GRID vGPU User Guide which includes a section on “performance tuning”

  • During last weeks NVIDIA GTC (March 2014), Andy Currid gave a great presentation with the title “Delivering High-Performance Remote Graphics with NVIDIA GRID Virtual GPU” which includes a section on Tuning vGPU environments (tuning section starts at minute 28:50 but I encourage you to watch the entire replay). The presentation includes tuning tips and tricks on platform basics, GPU selection and NUMA considerations.

    Session recordings of the Graphics Virtualization Summit at GTC 2014 can be found here:
    https://gridforums.nvidia.com/default/topic/11/announcements/session-recordings-graphics-virtualization-summit-at-gtc-2014/

That was my start to get this topic running. Looking forward for additional public sources and learnings from the field on how to fine tune vGPU implementations.

Thanks,
Erik Bohnhorst

GRID Solution Architect

Hello All,

Let me summarise the optimization peaces in the mentioned User Guide, presentation and add more.

  1. Disable console VGA
  2. Use "breadth-first" allocation (this is not the default value)
  3. Pin the vCPUs of the VM to the socket that is attached to the GPU that is used (NUMA) (see above presentation from Andy Currid)
  4. Use 4 vCPU (depends on application but remember that HDX3D Pro uses almost an entire vCPU for encoding, OS and application needs computing power as well)
  5. Make sure your Client OS sees all vCPUs in task manager (http://support.citrix.com/article/CTX126524)

Please feel free to add the experiences you have made …

Erik Bohnhorst | Solution Architect – GRID
NVIDIA Corporation

Hi Erik. I agree that Andy’s GTC presentation was excellent. Thanks for recapping the tuning information above.

I am currently rolling out a new infrastructure on XenServer with GRID K1 and K2 cards for XenDesktop 7.x. I would like to pin the VM’s to the socket that the GPU is adjacent to. Is there a way to architect the solution so that HA and XenMotion can still function? I am much more concerned about HA than XenMotion.

Thanks in advance for any advice you can provide!

Richard

Hello Again Erik,

Andy made a good point in his GTC presentation about configuring the BIOS and/or hypervisor for high performance modes so that the CPU’s P-states stay elevated. Then there is no lag when high performance graphics apps in a VDI environment need the CPU. Andy did not mention C-states in his presentation and I have not yet found a GPU/VDI expert to make a specific recommendation on C-states. However, Andy and others have recommended enabling TurboBoost for some intensive 3D applications. Here’s an explanation of my question on C-states and TurboBoost:

TurboBoost reaches its highest frequencies if some cores are inactive. Therefore, TurboBoost is much less likely to reach its highest frequencies if all the cores are active. In other words, c-states should be enabled if you want TurboBoost to function at its highest levels. The higher the c-state number, the deeper the sleep level, and the longer it takes to return to an active state. I am looking for the sweet spot of c-states so that TurboBoost can reach high frequencies but the cores do not lag when being awakened.

Or, is it advisable to simply disable c-states? Then, enable turbo-boost and let turbo-boost work as much as it can with all active cores?

The C-state options on the SuperMicro BIOS I am working with are C0, C2, C6, and “No Limit.” It also has a separate option for C1E Support.

Our environment specs are below:

SuperMicro X9DRG-HF+II
CPUs: Ivy Bridge 10C E5-2690V2 3.0G 25M 8GT/s QPI
XenServer 6.2 SP1
XenDesktop 7.x
GRID K1 and K2’s in the environment
Autodesk apps: Revit, AutoCAD, Maya, Inventor, 3DsMax
VM’s: 6 vCPU / 16GB RAM

Do you or Andy have a recommendation on C-state level?

Thanks!

Richard

Richard,
There is still no support for Xenmotion for VMs associated with a GPU. Regarding preferred C-state settings and turbo mode, see these two very useful articles:
http://www.poppelgaard.com/citrix-3d-graphics-pack-vgpu and http://www.xenserver.org/partners/developing-products-for-xenserver/19-dev-help/138-xs-dev-perf-turbo.html
Regards,

Thanks for the quick response, Tobias.

Do you know if there are any limitations in HA if the VM’s are pinned to a particular CPU? Can the VM failover without issue to another XenServer host if the first goes offline?

I read through the two articles you posted again. I don’t believe they actually make a recommendation on c-states in relation to TurboBoost. There are certainly some in-depth information there but my question about finding the sweet spot of c-state vs TurboBoost is not yet answered.

Thanks again for any other help you can provide!

Richard

Hi, Richard:

I would say that from at least various discussions revolving around the next gen XenServer (Creedence) that CPU-pinning is perhaps not as big a gain as it was before. There have been experiments that showed with all CPUs available as dom0 instances and no pinning that optimal I/O could be achieved, for example. This is partly because things are handled somewhat differently as to drivers and interactions with the kernel.

I played earlier with turbo mode and found on busy systems that things didn’t seem any better on average, as things were too loaded to allow the maximum turbo state to be sustained. I think your idea of disabling C-states altogether and testing your environment with turbo mode enabled and also disabled might yield quick stats that may at least allow you to decide what’s best for your environment. We would all be interested in the outcome.