Currently nvidia-smi reports high GPU utilization if the GPU is waiting for network (ie, inside one of the NCCL ops).
However, this is different from the case of high GPU utilization due to compute kernels running – the former case suggests an opportunity for performance optimization.
Is there a way to tell these cases apart? IE, are there other measurements (maybe temperature or current?) that would indicate that GPU compute cores are underutilized?