Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

I run/manage a rendering farm with a bunch of RTX 30XX cards. They are slightly overclocked and run 24/7. We try to keep them as cool as possible but from time to time we have crashes because some cards start to run hotter.

We had to run a benchmark for 12h using Windows and HWINFO64 to understand what was going on until we saw the memory junction temperature getting around 100C (max: 108C) until the card crashed.

We understand that this is related with GDDR6X memory and how the chips work.
However, we are unable to correctly monitor the memory temperature because we run our cluster under Linux.

Our goal is to keep the investment on these cards but without proper monitoring and the risk of loosing cards due to high temps is hard to justify not going to AMD or older NVIDIA cards.

Note: we use Prometheus Exporters/Grafana to monitor the host and each card. Unfortunately, due to the lack of support in Linux, the exporter is also not able to export the memory junction temperature.

6 Likes