[SOLVED] per process statistics API (nvidia-smi pmon)

Hello.

I cannot find NVML (version R352) API to get per process statistics (various % util). usedGpuMemory is only available in nvmlProcessInfo_t.
Is the API with statistics like “nvidia-smi pmon” unavailable ?

The accounting statistics API (nvmlDeviceGetAccountingStats()…) include actual load on whole GPU (nvmlDeviceGetUtilizationRates()) to all running processes statistics (eg. busy process increase statistics for idle process).
Is this expected behavior ?

Testing prog:

// # gcc -o nvml_test nvml_test.c -I /usr/include/nvidia/gdk -l nvidia-ml -std=c99

#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <strings.h>
#include <string.h>

#include <nvml.h>

int main(int argc, char *argv[]) {
 unsigned int count;
 assert( nvmlInit() == NVML_SUCCESS );
 assert( nvmlDeviceGetCount(&count) == NVML_SUCCESS );

 if ((argc == 2) && (!strcmp(argv[1], "restart"))) {
  for(int i=0; i<count; i++) {
   nvmlDevice_t device;

   assert( nvmlDeviceGetHandleByIndex ( i, &device) == NVML_SUCCESS );

   char name[NVML_DEVICE_NAME_BUFFER_SIZE];
   assert( nvmlDeviceGetName ( device, name, sizeof(name) ) == NVML_SUCCESS );

   // clear statistics
   assert( nvmlDeviceSetAccountingMode( device, NVML_FEATURE_DISABLED) == NVML_SUCCESS );
   assert( nvmlDeviceSetAccountingMode( device, NVML_FEATURE_ENABLED) == NVML_SUCCESS );
 
   nvmlEnableState_t e_pers;
   assert( nvmlDeviceGetPersistenceMode (device, &e_pers) == NVML_SUCCESS );

   nvmlEnableState_t e_acct;
   assert( nvmlDeviceGetAccountingMode (device, &e_acct) == NVML_SUCCESS );
   printf("gpu[%d] '%s' %s%s\n", i, name, e_acct?"accounting ":"", e_pers?"persistent":"");
  }
 }

 // header
 printf("%4s %4s %4s %4s %6s\n", "id", "pid", "gpu%", "mem%", "memsz");

 while(1) {
  for(int i=0; i<count; i++) {
   nvmlDevice_t device;
   assert( nvmlDeviceGetHandleByIndex ( i, &device) == NVML_SUCCESS );

   nvmlMemory_t mem;
   assert( nvmlDeviceGetMemoryInfo (device, &mem)== NVML_SUCCESS );

   nvmlUtilization_t util;
   assert( nvmlDeviceGetUtilizationRates ( device, &util) == NVML_SUCCESS );

   // global statistics
   printf("%4d %4s %4d %4d %6d\n", i, "-", util.gpu, util.memory, mem.used/1024/1024);

   unsigned int pa_count=64;
   unsigned int pa[64];
   assert( nvmlDeviceGetAccountingPids( device, &pa_count, pa ) == NVML_SUCCESS );
   for(int j=0; j<pa_count; j++) {
     nvmlAccountingStats_t stat;
     assert ( nvmlDeviceGetAccountingStats (device, pa[j], &stat) == NVML_SUCCESS );
     // per process accounting (cumulative) statistics 
     printf("%4d %4d %4d %4d %6d\n", i, pa[j], stat.gpuUtilization, stat.memoryUtilization, stat.maxMemoryUsage/1024/1024);
   }
  }
  fflush(stdout);
  sleep(1);
 }

 assert( nvmlShutdown() == NVML_SUCCESS );
 return 0;
}

When will be available “nvidia-smi pmon” on GRID vGPU to get performance metrics for per VM or per process in VM ?

Thanks for answers, Martin Cerveny

Due to Nvidia incapability to answer any question I reply to myself. I recently found requested information in unreleased cuda8 package.

NVML API REFERENCE MANUAL, May 7, 2016. Version 361.55 (nvml.pdf) - page 32, issued new warning (Yes! Modified the same day when post query !):

Warning:
On Kepler devices per process statistics are accurate only if there’s one process running on a GPU.

When will this bug be repaired (“warning” is not bugfix) ?

Thanks, M.C>

Due to Nvidia incapability to answer any question I reply to myself. I recently found requested information in “Grid4.0” announcement (https://blogs.nvidia.com/blog/2016/08/24/nvidia-grid-monitoring/). It seems to me that Nvidia finally found that big server usage and virtualization usage without deep-in resource observability is useless. Let’s wait more to Nvidia understand that also management of shared resources (vGPU programmable time-sliced scheduler with capping+share proportion) is the key element of virtualization too (https://gridforums.nvidia.com/default/topic/743/?comment=2558). Nvidia, it is time to reinvent wheel !

The “new” API is only vGpu API. Other utilization metrics are only bug fixes.
I can download “Grid4.0” package and I can found flowing new API in libnvidia-ml.so.367.43:

nvmlDeviceGetVirtualizationMode
nvmlDeviceSetVirtualizationMode
nvmlDeviceGetActiveVgpus
nvmlDeviceGetCreatableVgpus
nvmlDeviceGetSupportedVgpus
nvmlDeviceGetVgpuUtilization
nvmlVgpuInstanceGetFbUsage
nvmlVgpuInstanceGetFrameRateLimit
nvmlVgpuInstanceGetLicenseStatus
nvmlVgpuInstanceGetType
nvmlVgpuInstanceGetUUID
nvmlVgpuInstanceGetVmDriverVersion
nvmlVgpuInstanceGetVmID
nvmlVgpuTypeGetClass
nvmlVgpuTypeGetDeviceID
nvmlVgpuTypeGetFramebufferSize
nvmlVgpuTypeGetFrameRateLimit
nvmlVgpuTypeGetLicense
nvmlVgpuTypeGetMaxInstances
nvmlVgpuTypeGetName
nvmlVgpuTypeGetNumDisplayHeads
nvmlVgpuTypeGetResolution

As usual NVidia intentionally forget to publish/update “nvml.h” in CUDA(8) or release new GDK(36x).
Where to download updated “nvml.h” ?
When will the grid cards K1/K2 be fully supported ?

Thanks, M.C>

Due to Nvidia incapability to answer any question I reply to myself. NVidia published new “nvml.h” and “nvml_grid.h” in new sdk “GRID Software Management SDK” https://developer.nvidia.com/nvidia-grid-software-management-sdk (eg. not released under GDK or CUDA).

M.C>

NVidia bugfixed standard guest utilization API (not new host API) only for new Maxwell grid card only. When Kepler grid card is used the zero value is intentionally returned instead real value. NVidia should rethink this solution.

  1. This utilization monitoring problem is known and confirmed bug of drivers for few years without fixing (returning false values).
    https://www.youtube.com/watch?v=lW_mt0kKY-w

  2. Nvidia official documentation (GRID VIRTUAL GPU DU-06920-001 _v4.0 (GRID) | August 2016) in “Chapter 4. MONITORING GPU PERFORMANCE” does not limit monitoring function for specific grid cards.

  3. All new host API “nvml_grid.h” including nvmlDeviceGetVgpuUtilization() are declared “For Kepler or newer fully supported devices.”.

This is definitely bug and should be repaired for Kepler grid cards !

M.C>

Too bad Nvidia hasn’t answered you! I hope you’ll have some luck.

I am developing a GPU monitoring tool for my company, and the code you posted is very helpful, thank you very much! I have a question, though: the following functions (which I want to use) are not covered by Nvidia’s API documentation:

nvmlDeviceGetPersistenceMode
nvmlDeviceGetAccountingMode
nvmlDeviceGetAccountingPids
nvmlDeviceGetAccountingStats

Where did you find the documentation?

Thank you,
M.

Standard NVML API docs:
http://docs.nvidia.com/deploy/nvml-api/group__nvmlAccountingStats.html#group__nvmlAccountingStats
http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries
https://docs.nvidia.com/deploy/pdf/NVML_API_Reference_Guide.pdf

Newer versions are bundled with CUDA SDK.

Due to Nvidia incapability to answer any question I reply to myself.
Second try for “NVIDIA Virtual GPU software management SDK” (it is NVML) was released (https://developer.nvidia.com/nvidia-grid-software-management-sdk)

Finally published (after 2 years of waiting) per process utilization API (usable for >= r375 and >= Maxwell) !
nvmlDeviceGetProcessUtilization()
nvmlDeviceGetVgpuProcessUtilization()