I’ve been looking at this issue for the last week or so and cannot determine the answer. We have a cluster (RHEL 6.3) which has GPU nodes (some have M2090 and some have K20). On the K20 nodes, the GPUs seem to have utilization, but there are no processes running on the boards. Using the NVIDIA SMI tool, obtain the simple output:
+------------------------------------------------------+
| NVIDIA-SMI 4.310.32 Driver Version: 310.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m | 0000:2A:00.0 Off | 0 |
| N/A 29C P0 47W / 225W | 0% 11MB / 4799MB | 23% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m | 0000:90:00.0 Off | 0 |
| N/A 30C P0 45W / 225W | 0% 11MB / 4799MB | 78% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
###AND For the M2090, we can see a similar result, but GPU utilization is 0:
+------------------------------------------------------+
| NVIDIA-SMI 4.310.32 Driver Version: 310.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2090 | 0000:2A:00.0 Off | 0 |
| N/A N/A P0 75W / 225W | 0% 9MB / 5375MB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2090 | 0000:90:00.0 Off | 0 |
| N/A N/A P0 77W / 225W | 0% 9MB / 5375MB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
I’m curious if anybody has any suggestions to trace GPU utilization beyond SMI so that I can remove the culprit. Also, we have reset the nodes and reset the cards as well. It simply just didn’t help. I appreciate any help on this as the K20 are actually significantly slow to compute on versus the K20 I have in my workstation. Thanks!
Excuse the single code block, the website apparently does not like two code blocks.
I see that you have ECC Enabled. Do you happen to have Persistence Mode Disabled?
During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.
As a rule of thumb always run with Persistence Mode Enabled. Just run as root nvidia-smi -pm 1. This will speed up application lunching by keeping the driver always loaded.
Let me know if that explains the results you’re seeing.
I just logged into one of the nodes, and the persistent mode has done the trick. Thank you!
I remember vaguely hearing about it at GTC 2013 and remembering that we should enable it…another sys admin did the driver install. The nodes all have the same OS image, so they all have the pretty much pristine identical settings which bring me to 2 more questions.
Does the M2090 not require persistent mode, they seem not to be affected?
And I have read that this in not a permanent change so we need to add this into our start-up scripts so on boot, the persistent mode is always on?
Does the M2090 not require persistent mode, they seem not to be affected?
By the time NVSMI is done initialising ECC scrubbing is done as well, so there’s a race between the query and whether utilisation still keeps the values from the scrubbing.
Some queries or initialisation itself might take a bit longer on M2090 than K20. Hope this explains your concerns.
M2090 is best to be used with Persistence Mode Enabled as well as K20.
And I have read that this in not a permanent change so we need to add this into our
start-up scripts so on boot, the persistent mode is always on?
Correct. You need to enable persistence mode after every reboot.
The first time the driver is loaded the driver needs to initialize the ECC check bits in the device memory. Rebooting or performing a GPU reset will force this initialization to occur (even if persistence mode is enabled). During the initialization process the GPU will report high utilization.
yet the gpu starts at 99% utilization after a reboot or gpu reset
My expectation is that the ECC check bit initialization will complete after a few seconds, and the GPU utilization will fall to 0%. Can you confirm if the GPU utilization drops after a few seconds?
No, The GPU remains at 99% utilization as long as the machine is working. I rebooted as I thought some process might have “hung” in the GPU (this happened after trying out some cuda enabled namd - I rebooted after ~ 24 hours). After the reboot, utilization is still at 99% for a few hours now.
I have disabled ECC (nvidia-smi -e 0) and rebooted the computer. Now GPU utilization is 0%, but ECC is disabled. Is there a different solution, or is the memory infallible, and ECC not really required?
One more thing to note - up to now the Performance state was P0. since disabling ECC it’s been P8. I suspect the ECC scrub never completed correctly. Is there a way to verify that the HW is working correctly?
I’ve only seen incorrect results with a GTX Titan when I overclock it somewhere above 1200MHz and performing CUDA calculations using 99-100% GPU usage. (same GK110 chipset) Of course YMMV, but bench test your card against some sort of numerically verifiable results if possible to determine it is stable with ECC disabled. Also, you might want to seek support with your K20c vendor – they should have the pull to your elevate your concerns of a possible issue to a knowledgeable NVIDIA rep who should be able to assist, after all, that’s part of the $$$$ you pay for this product.
In addition to following up with your system vendor as suggested by vacaloca, it would make sense to file a bug using the bug reporting form linked from the registered developer website. The GPU reported as running at P0 and full utilization while no identifiable compute process is running is not expected behavior.
Hi all,
Have you focused on the power of the k20c?
I used the SHOC benchmark to test the k20c. And the max power of the k20c is just about 150w, and it would not be bigger than 160W,even thoug the GPU utilization is reached 99%…
I have tied the SHOC benchmark on Quadro 4000/Tesla M2090/Grid K2, the card’s power could reach 95% of the TDP. So, I think the benchmark tool is no problem.
I just doubt something is wrong with my k20c.
By the way ,I have checked the Power limite is 225W.
*******************************************************************************
Mon Jul 15 19:43:51 2013
+------------------------------------------------------+
| NVIDIA-SMI 5.319.32 Driver Version: 319.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:06:00.0 Off | Off |
| 35% 47C P0 146W / 225W | 283MB / 5119MB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 6316 ./SGEMM 267MB |
+-----------------------------------------------------------------------------+
****************************************************************************************
The act of running nvidia-smi tool can generate utilization on a GPU. This is not a matter of concern. Regarding power utilization, a benchmark like Rodinia is not sufficient to draw full power from a GPU, even though the reported “utilization” may be 99%.