Performance/power management problem on shared vGPU

Hello.

Situation 0 - 1 "idle" windows aero on one physical GPU:

If I run 1 vGPU per physical GPU, all seems to be fine, I catch maximum framerate 25 FPS (limited by plugin0.frl_config).

Situation 1 - 2 "idle" windows aero on one physical GPU:

If I run 2 vGPU per physical GPU, framerate on both drop below 17 FPS.

Observation:

nvidia-smi

+------------------------------------------------------+                       
| NVIDIA-SMI 340.34     Driver Version: 340.34         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K1             On   | 0000:06:00.0     Off |                  N/A |
| N/A   33C    P8    11W /  31W |   1820MiB /  4095MiB |     20%      Default |
+-------------------------------+----------------------+----------------------+
... snip

lspci -s 06:00.0 -vvv | grep LnkSta:

LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

nvidia-smi -q -i 0 -d CLOCK | head -12

==============NVSMI LOG==============

Timestamp                           : Sun Jan 11 15:55:58 2015
Driver Version                      : 340.34

Attached GPUs                       : 4
GPU 0000:06:00.0
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz

Interpretation:

Physical GPU stay in lowest powerstate "P8" (lowest MEM, CPU and PCIe clocks) and does not able to deliver required processing power.

Situation 2 - 1 "idle" windows aero and one "3D application" on one physical GPU:

If I run small 3D application on one vGPU it leaves change powerstate to "P0" but it is still unable to achieve requested framerate. On both shared vGPU framerate is now about 23 FPS.

Observation:

nvidia-smi

+------------------------------------------------------+                       
| NVIDIA-SMI 340.34     Driver Version: 340.34         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K1             On   | 0000:06:00.0     Off |                  N/A |
| N/A   38C    P0    17W /  31W |   1820MiB /  4095MiB |     31%      Default |
+-------------------------------+----------------------+----------------------+
... snip

lspci -s 06:00.0 -vvv | grep LnkSta:

LnkSta:	Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

(Speed unknown == PCIe Gen3)

nvidia-smi -q -i 0 -d CLOCK | head -12

==============NVSMI LOG==============

Timestamp                           : Sun Jan 11 16:11:00 2015
Driver Version                      : 340.34

Attached GPUs                       : 4
GPU 0000:06:00.0
    Clocks
        Graphics                    : 680 MHz
        SM                          : 680 MHz
        Memory                      : 891 MHz

Interpretation:

Now the physical GPU is not in lowest powerstate, but there is still performance throttling !

QUESTION:

I tried direct clock management but it is unsupported on Grid K1.

nvidia-smi –q –d SUPPORTED_CLOCKS           # Show Supported Clock Frequencies
nvidia-smi &ndash;ac <MEM clock, Graphics clock>  # Set the Memory and Graphics Clock Frequency

How to reprogram automatic power-management ?

Thanks for answers, Martin Cerveny

http://xenserver.org/partners/developing-products-for-xenserver/19-dev-help/138-xs-dev-perf-turbo.html Might help, are the fans turned up?

Thanks, but it does not help. There is problem with GPU power management not CPU power management :-(

M.C>

Hi Martin,
What system are you using (from the hcl list)?
It looks like the card is in a x8 PCIe slot. It needs to be in a x16 slot. Are you able to switch slots and try again?

Thanks for your hint, but it does not help. K1 card is inserted in 16x PCIe 3 gen and there is PLX bridge on card (PEX 8747 in 16+48 configuration, http://www.plxtech.com/download/file/1824 ) that connects 4GPU only by 8x PCIe 3 gen link.
PLX bridge faced to CPU is running always in 16x and full 3 gen speed (eg. "Speed unknown"). But PLX<->GPU side runs in 8x and speed negotiated by GPU power management.

M.C>

lspci -vvv | egrep ‘^04:|^05|LnkSta:’

...
04:00.0 PCI bridge: PLX Technology, Inc. Device 8747 (rev ca) (prog-if 00 [Normal decode])
		LnkSta:	Speed unknown, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
05:08.0 PCI bridge: PLX Technology, Inc. Device 8747 (rev ca) (prog-if 00 [Normal decode])
		LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
05:09.0 PCI bridge: PLX Technology, Inc. Device 8747 (rev ca) (prog-if 00 [Normal decode])
		LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
05:10.0 PCI bridge: PLX Technology, Inc. Device 8747 (rev ca) (prog-if 00 [Normal decode])
		LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
05:11.0 PCI bridge: PLX Technology, Inc. Device 8747 (rev ca) (prog-if 00 [Normal decode])
		LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
...

I see.
A couple of suggestions:

  • Verify that your system is capable of providing enough gfx power
  • Try with another card to verify if it is the system or the specific card that is giving you problems

Any results from this?

I do not understand. The system is ok (DomU(Win7)/Dom0(Xen)/2xE5v2Xeon/32G RAM and K1) and yes there is the problem with GPU that is throttled down by GPU’s own bad power management. As I wrote in original post I do not known how to disable or reprogram GPU’s power management. If you mean power supply of 12V it is sufficient too (running on 1/2 of 665W limit with max 54A on 12V). If you mean cooling it is ok too (sensing temperature between 29C in idle to 43C in load, throttling is 95C in specs).

There are only two card K1 an K2 capable of vGPU, I have accessible only two K1 cards.
I will build second server after next week and try with some version of XenServer, other version of GPU bios to test if the problem persists.

M.C>

FYI - resolved without NVidia support: The problem with bad P-State management persists many years without any support. Ok, I checked driver settings and found the answer. There is driver registry setting “RMForcePstate” that fulfill my needs. Setting can be persistently programmed with echo ‘options nvidia NVreg_RegistryDwords="…"’ > /etc/modprobe.d/nvidia.conf (xen based or generic linux) or esxcli system module parameters set -m nvidia -p “NVreg_RegistryDwords=…”.

# nvidia-smi -pm 1; sleep 30; nvidia-smi 
Enabled persistence mode for GPU 00000000:01:00.0.
All done.
Fri Apr  5 09:53:23 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.113                Driver Version: 390.113                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 27%   45C    P8    14W / 180W |     21MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# rmmod nvidia; modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0"; nvidia-smi -pm 1; sleep 30; nvidia-smi 
Enabled persistence mode for GPU 00000000:01:00.0.
All done.
Fri Apr  5 09:54:30 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.113                Driver Version: 390.113                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 25%   45C    P0    38W / 180W |     21MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+