So, the odd thing was that only one engine was showing this symptom, and the symptom was not evident in the other server with a different M60 card and all.
However, further poking around revealed that not only the original server with the issue, but the other server as well, had compute mode still turned on!!! So both servers were affected, even though the weirdness only showed up on the one server.
After fixing this, all now looks great and the pegged GPU in the one case is a thing of the past. For XenServer, here’s a quick procedure that’s IMO even faster than using the ISO from the gpumodeswitch download area with the tiny Linux kernel.
Quick and Dirty GPU in Compute Mode Fix for XenServer
Are the GPU engines in compute mode? Check with:
lspci -n | grep 10de
06:00.0 0302: 10de:13f2 (rev a1)
07:00.0 0302: 10de:13f2 (rev a1)
The 0302 indicates you are indeed in compute mode instead of graphics mode (which would otherwise show up as 0300).
You can get utilities from the license/software product information center (you need an account to be able to log in):
https://nvidia.flexnetoperations.com/control/nvda/index
Grab the gpumodeswitch package. Within it, you’ll see a plain Linux executable called "gpumodeswitch".
Need to upload that gpumodeswitch Linux executable onto the XenServer dom0 and run as root.
Make executable:
chmod 700 gpumodeswitch
then run these three commands in sequence:
service xcp-rrdd-gpumon stop
rmmod nvidia
./gpumodeswitch --gpumode graphics
and afterwards, reboot.
After the reboot, check again:
lspci -n | grep 10de
06:00.0 0300: 10de:13f2 (rev a1)
07:00.0 0300: 10de:13f2 (rev a1)
Yes, the values are now 0300 as they should be.
Now, nvidia-smi shows all is calm, all is bright:
nvidia-smi
Wed May 11 20:56:14 2016
±-----------------------------------------------------+
| NVIDIA-SMI 361.40 Driver Version: 361.40 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:06:00.0 Off | Off |
| N/A 27C P8 24W / 150W | 14MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 0000:07:00.0 Off | Off |
| N/A 26C P8 24W / 150W | 14MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
:-)
I’d also like to point out Jason Southern’s really nice video on this topic: - YouTube