SOLVED: GRID M60 has one engine pegged at 100% CPU with no load

A GRID M60 has one of its engines pegged at 100% CPU while the other shows 0%. I have two identical servers and the second shows 0% for both engines. Only passthrough mode is enabled and there are no VMs installed, so the load should be zero. I tried to find the gpumodeswitch utility which seems to not exist anywhere, thinking being in compute mode might be the issue, but it is nowhere to be found.

Platform is Dell R730, XenServer 6.5 SP1 fully patched up to 028. All BIOS settings appear to be identical. These are in a pool; swapping pool master makes no difference, either. These are brand new servers and M60 units so I’m hoping it’s not a faulty GPU card.

So, the odd thing was that only one engine was showing this symptom, and the symptom was not evident in the other server with a different M60 card and all.

However, further poking around revealed that not only the original server with the issue, but the other server as well, had compute mode still turned on!!! So both servers were affected, even though the weirdness only showed up on the one server.

After fixing this, all now looks great and the pegged GPU in the one case is a thing of the past. For XenServer, here’s a quick procedure that’s IMO even faster than using the ISO from the gpumodeswitch download area with the tiny Linux kernel.

Quick and Dirty GPU in Compute Mode Fix for XenServer

Are the GPU engines in compute mode? Check with:

lspci -n | grep 10de

06:00.0 0302: 10de:13f2 (rev a1)
07:00.0 0302: 10de:13f2 (rev a1)

The 0302 indicates you are indeed in compute mode instead of graphics mode (which would otherwise show up as 0300).

You can get utilities from the license/software product information center (you need an account to be able to log in):

https://nvidia.flexnetoperations.com/control/nvda/index

Grab the gpumodeswitch package. Within it, you’ll see a plain Linux executable called "gpumodeswitch".

Need to upload that gpumodeswitch Linux executable onto the XenServer dom0 and run as root.

Make executable:

chmod 700 gpumodeswitch

then run these three commands in sequence:

service xcp-rrdd-gpumon stop

rmmod nvidia

./gpumodeswitch --gpumode graphics

and afterwards, reboot.

After the reboot, check again:

lspci -n | grep 10de

06:00.0 0300: 10de:13f2 (rev a1)
07:00.0 0300: 10de:13f2 (rev a1)

Yes, the values are now 0300 as they should be.

Now, nvidia-smi shows all is calm, all is bright:

nvidia-smi

Wed May 11 20:56:14 2016
±-----------------------------------------------------+
| NVIDIA-SMI 361.40 Driver Version: 361.40 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:06:00.0 Off | Off |
| N/A 27C P8 24W / 150W | 14MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 0000:07:00.0 Off | Off |
| N/A 26C P8 24W / 150W | 14MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

:-)

I’d also like to point out Jason Southern’s really nice video on this topic: - YouTube

For anyone experiencing weird behaviour particulalrly from nvidia-smi with a brnad new M60 / M6 - the sanity check that it is in graphics mode here: Having problems with new M6/M60 like VMs fail to power on, NVRM BAR1 error, ECC is enabled, or nvidia-smi fails | NVIDIA is a must!

Thank you tobias for the useful explanation.

Rachel