Configuring DCGM in production

Hi, I am interested in setting up DCGM on kubernetes to monitor GPU health. I was hoping to get some answers to these questions:

  1. What is the extent of overlap between nvidia-smi and dcgmi diag/dcgmi health ?
  2. How does dcgmi health work? Is it running any diagnostics or passively monitoring some system logs?
  3. Which errors here are recoverable without any external action? - DCGM Diagnostics — NVIDIA DCGM Documentation latest documentation