NVML overhead

I realized that calling nvidia-smi on my nvidia GTX 1080 every 0.1 seconds using watch nvidia-smi -n 0.1, then there is a noticeable degradation of performance of 20%.

  1. Why is the overhead so large?

  2. what NVML functions compose nvidia-smi? Here is what I get from my 1080

| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1080    Off  | 00000000:04:00.0 Off |                  N/A |
| 27%   38C    P2    37W / 180W |   1086MiB /  8119MiB |      0%      Default |
|   1  GeForce GTX 1080    Off  | 00000000:82:00.0 Off |                  N/A |
| 27%   30C    P8    10W / 180W |   1090MiB /  8119MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      8575      C   python                                       579MiB |
|    0      8576      C   python                                       497MiB |
|    1     22118      C   python                                       497MiB |
|    1     83757      C   python                                       583MiB |
  1. How can I get basic information from my graphics card – utilization and DRAM usage – every 0.1 seconds without any significant overhead?

What is the use case that requires you to track utilization and DRAM usage with 0.1 second resolution? What bad things would happen if you query once a second instead?

If you look at the metrics programmers can query with NVML and compare to the output of nvidia-smi, it should be largely apparent which metrics correspond to which output. Maybe a bit of a tedious task, but seems doable. For example nvmlDeviceGetPowerUsage ( nvmlDevice_t device, unsigned int* power ) should correspond to the Pwr:Usage output of nvidia-smi.

I suspect (don’t know for sure; haven’t checked) that the more metrics you query via NVML the more overhead you incur. Have you tried retrieving just a single property to see whether that reduced overhead?

I would think that most of the work of the GPU memory allocator happens in the driver, i.e. code running on the GPU. Presumably using a faster host system would speed up memory allocator related queries (and maybe others as well). Have you tried that?

Hi njuffa,

Thanks as always for ur thoughtful answers

Our company is hosting a real-time text-to-speech service where any delay experienced by the customer would degrade the quality of service by that much. For Slurm or IBM LSF to schedule a work to a specific GPU, it needs to query information at a certain interval. If this interval is 1 second, that means it could take upto 1 additional second for the job to even be scheduled to the appropriate GPU. And 1 second would be way too long since it potentially adds extra 1 second to the pre-existing delay.

I should try this myself.

If most work happens on the code running on the GPU, why would faster host system speed up DRAM queries?

I could be wrong, but I don’t think task schedulers like LSF are designed to provide real-time scheduling of the kind you envision. Nor do I think is NVML designed for such soft real-time operation.

I am fairly certain that the control structures (e.g. block descriptors, linked lists) for the GPU memory allocator reside on the host. When nvidia-smi displays the GPU memory usage (lines 14 ff. in your sample output), it walks the GPU memory allocator data structures. This is all host-side activity, and should be faster if the host platform is faster (faster CPU [single-threaded], faster system memory).

I haven’t tried to reverse engineer NVML, but would think that all query-able metrics that don’t involve physical sensor data involve mostly, if not exclusively, host-side activity. They might require expensive userland / kernel transitions.

You should be able to time individual NVML queries one by one to find the approximate cost for each. For example, I would expect “GPU name” to be very fast (pull a string from the GPU context), but “GPU temperature” to be very slow (query a physical sensor on the GPU, maybe via an I2C driver or somesuch).


I ran matrixMulCUBLAS sample code in a loop.

I timed 100 iterations of matrixMulCUBLAS with and without the watch command running.

with watch: 2 minutes 33 seconds
without watch: 2 minutes 7 seconds

So there does appear to be significant host-side overhead. (The reported GPU GF was invariant between the two cases at ~5500GF).

Don’t run nvidia-smi so often. If you need a simple piece of monitoring that often, write your own code using NVML and benchmark it. There is an SDK for NVML and it is fairly easy to use. There is sufficient documentation to get started.

Hi isaaclee2313!

I faced the same issue lately. Playing around a bit I ended up with the following solution/workaround:

  • by executing the command nvidia-smi -q -l 86400 in the background, nvidia-smi will be kept in memory. It will run in loop mode and execute the query every 24 hours.
  • in the meantime, you can execute another nvidia-smi command, e.g.: nvidia-smi --query.

The command execution will be instant. I monitored this solution for a while and I didn’t discover any overhead.

Best regards.