Monitor k80

Hi, we have a number of ubuntu servers (minimum install) with k80’s. We’re looking to monitor the GPU on those server with any monitor tool available (e.g. nagios, prtg, logicmonitor etc.) using whatever means such as SNMP. I was looking for MIB’s for the k80 but could not find any. We need to have historical data on the temp, utilization, run reports and see graphs. And of course, setup alerts. What are my options grab that data from ubuntu? Im by no means a developer, i just need this info from an IT standpoint.

Thanks!

Use nvidia-smi for that, look at the --loop and --filename or daemon options.

Hey, thanks for the info.

nvidia-smi is great if im monitoring the box live. The --filename option is definitely an option to store historical data, but if only it was csv or some other meaningful delimiter. The way its now, i would need to perform some extensive parsing to make use of the data.

Is there anyone thats currently monitoring their GPUs? Im sure there’s a great need for this. I need to get the data into some tool where i could configure graphs and setup alerts.

Though you said ‘no means developer’ I think your best shot is to write a small python script that uses the nvidia-ml-py bindings. Shouldn’t be too hard. Also available for perl and of course most other languages.
See: NVML (Nvidia management library)
PS: you can also use the -x switch of nvidia-smi to get xml but a ton of it and still needs to be parsed.