Hi, I am interested in setting up DCGM on kubernetes to monitor GPU health. I was hoping to get some answers to these questions:
- What is the extent of overlap between
nvidia-smi
anddcgmi diag
/dcgmi health
? - How does
dcgmi health
work? Is it running any diagnostics or passively monitoring some system logs? - Which errors here are recoverable without any external action? - DCGM Diagnostics — NVIDIA DCGM Documentation latest documentation