K20 with high utilization, but no compute processes.

jbaksta · April 11, 2013, 8:03pm

I’ve been looking at this issue for the last week or so and cannot determine the answer. We have a cluster (RHEL 6.3) which has GPU nodes (some have M2090 and some have K20). On the K20 nodes, the GPUs seem to have utilization, but there are no processes running on the boards. Using the NVIDIA SMI tool, obtain the simple output:

+------------------------------------------------------+                       
| NVIDIA-SMI 4.310.32   Driver Version: 310.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:2A:00.0     Off |                    0 |
| N/A   29C    P0    47W / 225W |   0%   11MB / 4799MB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m               | 0000:90:00.0     Off |                    0 |
| N/A   30C    P0    45W / 225W |   0%   11MB / 4799MB |     78%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

###AND For the M2090, we can see a similar result, but GPU utilization is 0:

+------------------------------------------------------+                       
| NVIDIA-SMI 4.310.32   Driver Version: 310.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2090              | 0000:2A:00.0     Off |                    0 |
| N/A   N/A    P0    75W / 225W |   0%    9MB / 5375MB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2090              | 0000:90:00.0     Off |                    0 |
| N/A   N/A    P0    77W / 225W |   0%    9MB / 5375MB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

I’m curious if anybody has any suggestions to trace GPU utilization beyond SMI so that I can remove the culprit. Also, we have reset the nodes and reset the cards as well. It simply just didn’t help. I appreciate any help on this as the K20 are actually significantly slow to compute on versus the K20 I have in my workstation. Thanks!

Excuse the single code block, the website apparently does not like two code blocks.

Przemyslaw_Zych · April 12, 2013, 6:39pm

Hi jbaksta,

I see that you have ECC Enabled. Do you happen to have Persistence Mode Disabled?

During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.

When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.

As a rule of thumb always run with Persistence Mode Enabled. Just run as root nvidia-smi -pm 1. This will speed up application lunching by keeping the driver always loaded.

Let me know if that explains the results you’re seeing.

Regrads,
Przemyslaw Zych

jbaksta · April 12, 2013, 7:33pm

I just logged into one of the nodes, and the persistent mode has done the trick. Thank you!

I remember vaguely hearing about it at GTC 2013 and remembering that we should enable it…another sys admin did the driver install. The nodes all have the same OS image, so they all have the pretty much pristine identical settings which bring me to 2 more questions.

Does the M2090 not require persistent mode, they seem not to be affected?

And I have read that this in not a permanent change so we need to add this into our start-up scripts so on boot, the persistent mode is always on?

cheers,

Jared

Przemyslaw_Zych · April 13, 2013, 2:28pm

Does the M2090 not require persistent mode, they seem not to be affected?

By the time NVSMI is done initialising ECC scrubbing is done as well, so there’s a race between the query and whether utilisation still keeps the values from the scrubbing.
Some queries or initialisation itself might take a bit longer on M2090 than K20. Hope this explains your concerns.

M2090 is best to be used with Persistence Mode Enabled as well as K20.

And I have read that this in not a permanent change so we need to add this into our
start-up scripts so on boot, the persistent mode is always on?

Correct. You need to enable persistence mode after every reboot.

checkpalm · April 23, 2013, 10:56am

Hi,
I have also encountered a high utilization issue with k20m on CentOS 6.3

# nvidia-smi 
Tue Apr 23 13:40:00 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:03:00.0     Off |                    0 |
| N/A   42C    P0    50W / 225W |   0%   11MB / 4799MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

I have set Persistence mode to enabled, also in a startup script, yet the gpu starts at 99% utilization after a reboot or gpu reset

nvidia-smi -a

==============NVSMI LOG==============

Timestamp                       : Tue Apr 23 13:53:31 2013
Driver Version                  : 304.54

Attached GPUs                   : 1
GPU 0000:03:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0325112069301
    GPU UUID                    : GPU-96e8d3f9-d622-f02c-4cca-1da19e0b6a8b
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x03
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:03:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 2
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 11 MB
        Free                    : 4788 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 99 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 36
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 36
            Double Bit            
                Device Memory   : 25
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 25
    Temperature
        Gpu                     : 41 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 50.61 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

Can anyone help please?

Robert_Alexander · April 23, 2013, 4:42pm

Hey checkpalm,

The first time the driver is loaded the driver needs to initialize the ECC check bits in the device memory. Rebooting or performing a GPU reset will force this initialization to occur (even if persistence mode is enabled). During the initialization process the GPU will report high utilization.

yet the gpu starts at 99% utilization after a reboot or gpu reset

My expectation is that the ECC check bit initialization will complete after a few seconds, and the GPU utilization will fall to 0%. Can you confirm if the GPU utilization drops after a few seconds?

-Robert

checkpalm · April 23, 2013, 8:36pm

No, The GPU remains at 99% utilization as long as the machine is working. I rebooted as I thought some process might have “hung” in the GPU (this happened after trying out some cuda enabled namd - I rebooted after ~ 24 hours). After the reboot, utilization is still at 99% for a few hours now.

checkpalm · April 23, 2013, 9:31pm

I have disabled ECC (nvidia-smi -e 0) and rebooted the computer. Now GPU utilization is 0%, but ECC is disabled. Is there a different solution, or is the memory infallible, and ECC not really required?

One more thing to note - up to now the Performance state was P0. since disabling ECC it’s been P8. I suspect the ECC scrub never completed correctly. Is there a way to verify that the HW is working correctly?

vacaloca · April 23, 2013, 10:20pm

@checkpalm

I’ve only seen incorrect results with a GTX Titan when I overclock it somewhere above 1200MHz and performing CUDA calculations using 99-100% GPU usage. (same GK110 chipset) Of course YMMV, but bench test your card against some sort of numerically verifiable results if possible to determine it is stable with ECC disabled. Also, you might want to seek support with your K20c vendor – they should have the pull to your elevate your concerns of a possible issue to a knowledgeable NVIDIA rep who should be able to assist, after all, that’s part of the $$$$ you pay for this product.

njuffa · April 25, 2013, 7:46pm

In addition to following up with your system vendor as suggested by vacaloca, it would make sense to file a bug using the bug reporting form linked from the registered developer website. The GPU reported as running at P0 and full utilization while no identifiable compute process is running is not expected behavior.

Creasyknight · July 15, 2013, 11:49am

Hi all,
Have you focused on the power of the k20c?
I used the SHOC benchmark to test the k20c. And the max power of the k20c is just about 150w, and it would not be bigger than 160W,even thoug the GPU utilization is reached 99%…
I have tied the SHOC benchmark on Quadro 4000/Tesla M2090/Grid K2, the card’s power could reach 95% of the TDP. So, I think the benchmark tool is no problem.
I just doubt something is wrong with my k20c.
By the way ,I have checked the Power limite is 225W.

*******************************************************************************
Mon Jul 15 19:43:51 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.32   Driver Version: 319.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:06:00.0     Off |                  Off |
| 35%   47C    P0   146W / 225W |      283MB /  5119MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      6316  ./SGEMM                                              267MB  |
+-----------------------------------------------------------------------------+
****************************************************************************************

Could anybody help me to confirm this problem?

anon62324141 · March 19, 2015, 11:55am

I am having a similar problem. has this been solved? I am running on Centos 6.6

I also noticed persistence mode has been deprecated in favour of a persistence deamon, is this also recommended for the k20c and k40c?

Cheers!

Thu Mar 19 12:51:50 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.29     Driver Version: 340.29         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:03:00.0     Off |                    0 |
| 23%   39C    P0    63W / 235W |     23MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:83:00.0     Off |                    0 |
| 30%   36C    P0    48W / 225W |     11MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          Off  | 0000:84:00.0     Off |                    0 |
| 30%   39C    P0    52W / 225W |     11MiB /  4799MiB |     68%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+
[root@romulus westengjpvan]#

Robert_Crovella · March 19, 2015, 1:59pm

The act of running nvidia-smi tool can generate utilization on a GPU. This is not a matter of concern. Regarding power utilization, a benchmark like Rodinia is not sufficient to draw full power from a GPU, even though the reported “utilization” may be 99%.

Topic		Replies	Views
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4939	September 20, 2015
Tesla K20c GPU card utilization 99% with no compute process running Linux	0	981	October 15, 2014
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5705	October 2, 2013
Severe CUDA performances regression on Kepler hardware (K20, K40, K80) using latest drivers (410.xx) Linux	18	1541	November 26, 2018
Processes hang trying to ioctl /dev/nvidiactl CUDA Setup and Installation	6	4340	October 2, 2015
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64099	April 20, 2011
Only K40c is being utilized for computation out of two GPUs. Other one is K5200. CUDA Setup and Installation	4	1073	October 19, 2015
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4622	October 22, 2017
nvidia-smi slow process listing(not persistance related) CUDA Setup and Installation	2	1558	August 11, 2024
GPU Performance CUDA Programming and Performance	12	13364	March 5, 2019

K20 with high utilization, but no compute processes.

Related topics