Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

I run/manage a rendering farm with a bunch of RTX 30XX cards. They are slightly overclocked and run 24/7. We try to keep them as cool as possible but from time to time we have crashes because some cards start to run hotter.

We had to run a benchmark for 12h using Windows and HWINFO64 to understand what was going on until we saw the memory junction temperature getting around 100C (max: 108C) until the card crashed.

We understand that this is related with GDDR6X memory and how the chips work.
However, we are unable to correctly monitor the memory temperature because we run our cluster under Linux.

Our goal is to keep the investment on these cards but without proper monitoring and the risk of loosing cards due to high temps is hard to justify not going to AMD or older NVIDIA cards.

Note: we use Prometheus Exporters/Grafana to monitor the host and each card. Unfortunately, due to the lack of support in Linux, the exporter is also not able to export the memory junction temperature.

6 Likes

damn. seems jenson and his team donā€™t give a s**t.

3 Likes

+1.

I think this is a must.

3 Likes

+1 for this cc @kayccc

3 Likes

This is necessary for DL trainers. +1 request.

3 Likes

+1
More than 2 months have passed, Nvidia please take an action.

3 Likes

+1 here - this should definitely be made available for Linux, especially when we know there are potential issues here.

3 Likes

+1 Please

3 Likes

We are currently tracking it under internal bug number 3269484.

9 Likes

Thatā€™s great! Are you able to share any more information (when did you start addressing this/when was the ticket raised, any progress on the fix, any ETA for the fix)?

5 Likes

It was filed some time ago and is being prioritized. I hear your concerns and am making sure it gets addressed.

13 Likes

+1
Critical for deep learning

5 Likes

Please prioritise this, Iā€™ve been stuck a month without being able to use my expensive card because I canā€™t check vram temperatures when training models and donā€™t want to burn my card. This should be of the highest priority since we have no idea what our hardware is being exposed to during full loads.

4 Likes

+1
This functionality is needed.

3 Likes

+1 Fully agree

4 Likes

Knowing if you hit 110 C or more is really important especially if youā€™re overclocking. We need to be able to control the temperature in Linux :0

5 Likes

+10086 Fully agree

4 Likes

Would like this please!

4 Likes

+1. Critical for long renders

5 Likes

Please please implement this in linux driver.

So needed feature for linux users and many projects depend on that.

5 Likes