vGPU Utilization Per VM

I want to know if it is possible to see the vGPU utilization per VM. Not within the OS but from the Grid K1 card. For instance, if GPU 0 is at 80%, it would be great if I knew 45% of that number is coming from a specific VM. I looked through the forums but didn’t find any specific posts on this.

I tried using nvidia-smi CLI command. For instance: ‘nvidia-smi -q’ but while it showed detailed info on each physical GPU, including utilization. There was no per VM utilization.

Thanks and I appreciate any help on this request.

Also, the reason for this request is we are finding high GPU utilization with a low number of users in our first vGPU deployment. Here is the setup.

  • Grid K1
  • K120Q vGPU Profile
  • 3-5 users per physical GPU equals constant 95%+ utilization

User behavior is fairly standard with MS office applications and Chrome/IE with HW acceleration on.

Hi Taskman,

I’m afraid it isn’t but this is a need highlighted to product management. You can monitor the framebuffer for each VM though but not the GPU processing.

https://virtuallyvisual.wordpress.com/2015/07/27/limitations-in-monitoring-shared-nvidia-gpu-technologies/ (this is worth reading as explains how trying to monitor in VM would be very misleading)

https://virtuallyvisual.wordpress.com/2015/09/09/monitoring-nvidia-gpu-usage-of-the-framebuffer-for-vgpu-and-gpu-passthrough/

The framebuffer usage may give you an idea of which applications are using the GPU but has to be done via the process manager in VM.

I will pass on this feedback to the product managers.

Best wishes,
Rachel

BTW

What is the stack e.g. XenDesktop+vSphere? The K1 is essentially 4xK600 cards and chrome (and browsers in general) can be very GPU hungry see: https://www.virtualexperience.no/2015/11/05/mythbusting-browser-gpu-usage-on-xenapp/

So 4-5 users per pGPU is 1/4 of a K600 if they are watching a lot of video.

The codecs/graphics mode in use will also use CPU and or GPU (new blast extreme on view uses GPU).

Best wishes,
Rachel

Thanks Rachel for the quick response. We are currently using vSphere ESXi 6 with Horizon View 6.2.

I looked at both links and I’m trying to find where they mention how to monitor Frame Buffer. Is that a perfmon counter or CLI command? Thanks!

You can use passthrough mode to see gpusizer for one VM And test how much one VM is using before deciding vgpu profile. Another way is to make sure you have only one VM with vgpu active on one pgpu, then you can use gpu-z or uberagent to get gpu per process And per vm with correct result. If you have multiple vm’s on the same Physical gpu you cannot rely on in-VM metrics.

Browser video usage on a k1 is typically 3-4 users per physical gpu (pgpu), but CPU is still quite intense with gpu enabled browsers.

Have a look at these blogposts:
http://www.virtualexperience.no/2015/11/05/mythbusting-browser-gpu-usage-on-xenapp/
http://www.virtualexperience.no/2015/01/07/im-100-sure-that-100-is-not-100/

Thank you everyone for the quick responses, this is a very active forum.

Now, let me go back to the reason for this post. We are doing our first vGPU deployment and I’m noticing 99% utilization with 3-5 users on a pGPU.

I have completed more testing with process-explorer and GPU-Z. When a VM is on a pGPU by itself, it is consuming 20-25% GPU utilization. At first it looked like the tools would not help as they showed no process using GPU resources. However, once I loaded Chrome (GPU accelerated), process-explorer then registered it as using GPU resources.

No other process is using any GPU resources and I have tried stripping down all running processes to just the systems, vmware, nvidia processes.

I duplicated the behavior on two different hosts, each with two K1 cards being used. As soon as my user logs into the desktop on the VM, pGPU utilization hits 20-25%. This makes it seem like its the parent image or an issue with the configuration on the K1.

Thanks @mcerveny, I didn’t realize the PCOIP protocol played a factor in the pGPU utilization number. As soon as I disconnect from the VM, utilization drops to 0% as it was the only VM on that pGPU. Then back to 22-25% when logged back in.

I’m guessing (hoping) this is not normal. In case it is a factor, here is the GPO settings being used for PCOIP.

PCoIP Session Variables/Not Overridable Administrator Settingshide
Policy Setting Comment

Configure clipboard redirection Enabled
Configure clipboard redirection Enabled client to server only

Policy Setting Comment

Configure the PCoIP session bandwidth floor Enabled
Set PCoIP session bandwidth floor in kilobits per second to: 2000

Policy Setting Comment

Turn off Build-to-Lossless feature Enabled

New update, I was digging through the release notes and came across this in the known issues section. This is exactly what I’m seeing but it is odd that I didn’t find anyone else reporting it online. Can anyone at Nvidia provide a status on Ref# 1735009? Thanks

From Release notes of 361.40/362.13

nvidia-smi shows high GPU utilization for vGPU VMs with active
Horizon sessions

Description vGPU VMs with an active Horizon connection utilize a high percentage
of the GPU on the ESXi host. The GPU utilization remains high for the
duration of the Horizon session even if there are no active
applications running on the VM.
Version

Workaround None
Status Open
Ref. # 1735009

Hi Taskman,

That issue is open with VMware for resolution. I don’t know the root cause or any workaround I’m afraid, and it doesn’t affect every session.

PCoIP itself doesn’t use the GPU for encoding, but it does query the API’s to read directly from the FrameBuffer.

BLAST (since 7.0) will use the GPU for encoding when accessing from a client with a single display.

Magnar & mcerveny have both pretty much covered all the other likely causes, remember the K1 is a pretty small GPU, so it’s easy to load it up with a few browser apps, and often, though counter intuitive, the cards with just 2 GPU’s (K2 / M60) can give better performance and density if the application load requires GPU resource over Graphics Memory.

As M.C points out there are third-party tools liek Goliath, which is very good, they use the NVIDIA APIS and those derived from them by the hypervisor vendors and work with us closely to ensure used properly and interoperability is good. However they are limited as nvidia-smi is by the underlyign capabilities of the card to provide per VM info GPU resource usage and so it’s not functionality a third-party can provide either.

Best wishes,
Rachel

Thanks everyone. I just spoke with VMware and my issue matches the known issue in the release notes. They are escalating it on their end to Nvidia.

For the POC we are in, I had changed the deployment to Depth-First instead of breadth-first in order to do a load test and identify potential issues like this. For now, I’ll switch back to breadth-first to mitigate this issue until a resolution is released by Nvidia.

Per @JasonSouthern mentions of the capabilities of the K1s and the numbers we are seeing. I am also going to contact Nvidia about an eval for the M60 as part of our POC.

I’ll update this post once that occurs. Thanks again.

Hi folks,

Support have now published a KB explaining framebuffer monitoring both on host and per VM. So while you can’t get GPU resource per VM this may be of use for understanding your application use:
http://nvidia.custhelp.com/app/answers/detail/a_id/4108/

Best wishes,
Rachel

@Taskman, I think the K1 may be underpowered for your purposes. The M60 solution with passthrough licensing is quite affordable and will give you a lot more GPU power and will scale much better. That’s the route we’re in the process of implementing.

Taskman,

any update on this 20-25% GPU utilization issue when initiating a Horizon PCoIP session?

This is definitely a contributing factor to my lack of GRID performance.

Pascal

Hi Pascal,

It’s an issue identified in the VMware stack (i.e. not one NVIDIA can resolve) and as such you need to raise a ticket with them and request a fix (although I don’t believe one has been released yet). We are trackign it and passing on cases to VMware. We have a KB article in draft:

Symptom / Error
High GPU load is seen with vSphere/View deployments and NVIDIA GRID vGPU, this may be seen even when sessions/VMs are idle. nvidia-smi shows high GPU utilization for vGPU VMs with active Horizon session. vGPU VMs with an active Horizon connection utilize a high percentage of the GPU on the ESXi host. The GPU utilization remains high for the duration of the Horizon session even if there are no active applications running on the VM. NVIDIA Ref. #1735009

Workaround / Solution
There is no workaround currently and customers affected need to raise a support case with VMware who hope to release a fix in a future release of their product. The issue is within the Horizon View product and as such this is not an issue NVIDIA can resolve.

Affected Products
VMware Horizon View 7.0 and earlier when using NVIDIA GRID vGPU and NVIDIA GRID Cards (K1, K2, M60, M6, M10).

Citrix Products
This issue only affects VMware Horizon View and related Blast Extreme and PCoIP protocols. Citrix XenDesktop/XenApp and HDX/ICA are unaffected by this issue.

References
This issue is documented in the latest release notes (Version 361.40 / 362.13) for NVIDIA GRID vGPU for VMware:

much thanks Rachel.

I have just raised a ticket with VMWare.

Thanks again.

PAscal

Re: Symptom / Error
High GPU load is seen with vSphere/View deployments and NVIDIA GRID vGPU, this may be seen even when sessions/VMs are idle. nvidia-smi shows high GPU utilization for vGPU VMs with active Horizon session. vGPU VMs with an active Horizon connection utilize a high percentage of the GPU on the ESXi host. The GPU utilization remains high for the duration of the Horizon session even if there are no active applications running on the VM. NVIDIA Ref. #1735009

VMware have released a fix for the Blast Extreme protocol with VMware Horizon 7.0.1 update.

Users with issues on PCoIP need to continue to raise the need for a fix with that protocol with VMware.

Best wishes,
Rachel

Hi Rachel,

indeed the 7.0.1 update did not resolve PCoIP sessions. What is even more confusing is that when I raised a ticket with VMWare, they dont even have anything in their records that touches this subject.

They would like to get a contact person at NVIDIA to tell them more about the problem so that they can properly send the issue to their engineering department. I find that their answer is very strange, would you be so kind and tell me who you are in contact with at VMWare when you passed along the KB 1735009 to them? Somebody must know about this problem since they fixed it for the BLAST protocol…

Thanks Rachel,

Pascal

I have sent you a contact via private message.

How strange…

Rachel